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Preface  to  the  First  Edition 


Across  all  sciences,  a  quantitative  analysis  of  data  is  necessary  to  assess  the 
significance  of  experiments,  observations,  and  calculations.  This  book  was  written 
over  a  period  of  10  years,  as  I  developed  an  introductory  graduate  course  on 
statistics  and  data  analysis  at  the  University  of  Alabama  in  Huntsville.  My  goal 
was  to  put  together  the  material  that  a  student  needs  for  the  analysis  and  statistical 
interpretation  of  data,  including  an  extensive  set  of  applications  and  problems  that 
illustrate  the  practice  of  statistical  data  analysis. 

The  literature  offers  a  variety  of  books  on  statistical  methods  and  probability 
theory.  Some  are  primarily  on  the  mathematical  foundations  of  statistics,  some 
are  purely  on  the  theory  of  probability,  and  others  focus  on  advanced  statistical 
methods  for  specific  sciences.  This  textbook  contains  the  foundations  of  probability, 
statistics,  and  data  analysis  methods  that  are  applicable  to  a  variety  of  fields — 
from  astronomy  to  biology,  business  sciences,  chemistry,  engineering,  physics,  and 
more — with  equal  emphasis  on  mathematics  and  applications.  The  book  is  therefore 
not  specific  to  a  given  discipline,  nor  does  it  attempt  to  describe  every  possible 
statistical  method.  Instead,  it  focuses  on  the  fundamental  methods  that  are  used 
across  the  sciences  and  that  are  at  the  basis  of  more  specific  techniques  that  can 
be  found  in  more  specialized  textbooks  or  research  articles. 

This  textbook  covers  probability  theory  and  random  variables,  maximum- 
likelihood  methods  for  single  variables  and  two- variable  datasets,  and  more  complex 
topics  of  data  fitting,  estimation  of  parameters,  and  confidence  intervals.  Among  the 
topics  that  have  recently  become  mainstream,  Monte  Carlo  Markov  chains  occupy 
a  special  role.  The  last  chapter  of  the  book  provides  a  comprehensive  overview  of 
Markov  chains  and  Monte  Carlo  Markov  chains,  from  theory  to  implementation. 

I  believe  that  a  description  of  the  mathematical  properties  of  statistical  tests  is 
necessary  to  understand  their  applicability.  This  book  therefore  contains  mathemat¬ 
ical  derivations  that  I  considered  particularly  useful  for  a  thorough  understanding  of 
the  subject;  the  book  refers  the  reader  to  other  sources  in  case  of  mathematics  that 
goes  beyond  that  of  basic  calculus.  The  reader  who  is  not  familiar  with  calculus  may 
skip  those  derivations  and  continue  with  the  applications. 
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Nonetheless,  statistics  is  necessarily  slanted  toward  applications.  To  highlight 
the  relevance  of  the  statistical  methods  described,  I  have  reported  original  data 
from  four  fundamental  scientific  experiments  from  the  past  two  centuries:  J.J. 
Thomson’s  experiment  that  led  to  the  discovery  of  the  electron,  G.  Mendel’s  data 
on  plant  characteristics  that  led  to  the  law  of  independent  assortment  of  species, 
E.  Hubble’s  observation  of  nebulae  that  uncovered  the  expansion  of  the  universe, 
and  K.  Pearson’s  collection  of  biometric  characteristics  in  the  UK  in  the  early 
twentieth  century.  These  experiments  are  used  throughout  the  book  to  illustrate  how 
statistical  methods  are  applied  to  actual  data  and  are  used  in  several  end-of-chapter 
problems.  The  reader  will  therefore  have  an  opportunity  to  see  statistics  in  action 
on  these  classic  experiments  and  several  additional  examples. 

The  material  presented  in  this  book  is  aimed  at  upper-level  undergraduate 
students  or  beginning  graduate  students.  The  reader  is  expected  to  be  familiar 
with  basic  calculus,  and  no  prior  knowledge  of  statistics  or  probability  is  assumed. 
Professional  scientists  and  researchers  will  find  it  a  useful  reference  for  fundamental 
methods  such  as  maximum-likelihood  fit,  error  propagation  formulas,  goodness  of 
fit  and  model  comparison,  Monte  Carlo  methods  such  as  the  jackknife  and  bootstrap, 
Monte  Carlo  Markov  chains,  Kolmogorov- Smirnov  tests,  and  more.  All  subjects 
are  complemented  by  an  extensive  set  of  numerical  tables  that  make  the  book 
completely  self-contained. 

The  material  presented  in  this  book  can  be  comfortably  covered  in  a  one-semester 
course  and  has  several  problems  at  the  end  of  each  chapter  that  are  suitable  as 
homework  assignments  or  exam  questions.  Problems  are  both  of  theoretical  and 
numerical  nature,  so  that  emphasis  is  equally  placed  on  conceptual  and  practical 
understanding  of  the  subject.  Several  datasets,  including  those  in  the  four  “classic 
experiments,”  are  used  across  several  chapters,  and  the  students  can  therefore  use 
them  in  applications  of  increasing  difficulty. 


Huntsville,  AL,  USA 


Massimiliano  Bonamente 


Preface  to  the  Second  Edition 


The  second  edition  of  Statistics  and  Analysis  of  Scientific  Data  was  motivated  by 
the  overall  goal  to  provide  a  textbook  that  is  mathematically  rigorous  and  easy  to 
read  and  use  as  a  reference  at  the  same  time.  Basically,  it  is  a  book  for  both  the 
student  who  wants  to  learn  in  detail  the  mathematical  underpinnings  of  statistics 
and  the  reader  who  wants  to  just  find  the  practical  description  on  how  to  apply  a 
given  statistical  method  or  use  the  book  as  a  reference. 

To  this  end,  first  I  decided  that  a  more  clear  demarcation  between  theoretical  and 
practical  topics  would  improve  the  readability  of  the  book.  As  a  result,  several  pages 
(i.e.,  mathematical  derivations)  are  now  clearly  marked  throughout  the  book  with  a 
vertical  line,  to  indicate  material  that  is  primarily  aimed  to  those  readers  who  seek 
a  more  thorough  mathematical  understanding.  Those  parts  are  not  required  to  learn 
how  to  apply  the  statistical  methods  presented  in  the  book.  For  the  reader  who  uses 
this  book  as  a  reference,  this  makes  it  easy  to  skip  such  sections  and  go  directly 
to  the  main  results.  At  the  end  of  each  chapter,  I  also  provide  a  summary  of  key 
concepts ,  intended  for  a  quick  look-up  of  the  results  of  each  chapter. 

Secondly,  certain  existing  material  needed  substantial  re-organization  and  expan¬ 
sion.  The  second  edition  is  now  comprised  of  16  chapters,  versus  ten  of  the  first 
edition.  A  few  chapters  (Chap.  6  on  mean,  median,  and  averages,  Chap.  9  on  multi- 
variable  regression,  and  Chap.  1 1  on  systematic  errors  and  intrinsic  scatter)  contain 
material  that  is  substantially  new.  In  particular,  the  topic  of  multi- variable  regression 
was  introduced  because  of  its  use  in  many  fields  such  as  business  and  economics, 
where  it  is  common  to  apply  the  regression  method  to  many  independent  variables. 
Other  chapters  originate  from  re-arranging  existing  material  more  effectively.  Some 
of  the  numerical  tables  in  both  the  main  body  and  the  appendix  have  been  expanded 
and  re-arranged,  so  that  the  reader  will  find  it  even  easier  to  use  them  for  a  variety 
of  applications  and  as  a  reference. 

The  second  edition  also  contains  a  new  classic  experiment ,  that  of  the  measure¬ 
ment  of  iris  characteristics  by  R.A.  Fisher  and  E.  Anderson.  These  new  data  are  used 
to  illustrate  primarily  the  method  of  regression  with  many  independent  variables. 
The  textbook  now  features  a  total  of  five  classic  experiments  (including  G.  Mendel’s 
data  on  the  independent  assortment  of  species,  J.J.  Thomson’s  data  on  the  discovery 
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of  the  electron,  K.  Pearson’s  collection  of  data  of  biometric  characteristics,  and 
E.  Hubble’s  measurements  of  the  expansion  of  the  universe).  These  data  and  their 
analysis  provide  a  unique  way  to  learn  the  statistical  methods  presented  in  the  book 
and  a  resource  for  the  student  and  the  teacher  alike.  Many  of  the  end-of-chapter 
problems  are  based  on  these  experimental  data. 

Finally,  the  new  edition  contains  corrections  to  a  number  of  typos  that  had 
inadvertently  entered  the  manuscript.  I  am  very  much  in  debt  to  many  of  my  students 
at  the  University  of  Alabama  in  Huntsville  for  pointing  out  these  typos  to  me  over  the 
past  few  years,  in  particular,  to  Zachary  Robinson,  who  has  patiently  gone  through 
much  of  the  text  to  find  typographical  errors. 

Huntsville,  AL,  USA  Massimiliano  Bonamente 
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Chapter  1 

Theory  of  Probability 


Abstract  The  theory  of  probability  is  the  mathematical  framework  for  the  study 
of  the  probability  of  occurrence  of  events.  The  first  step  is  to  establish  a  method 
to  assign  the  probability  of  an  event,  for  example,  the  probability  that  a  coin  lands 
heads  up  after  a  toss.  Th zfrequentist — or  empirical — approach  and  the  subjective — 
or  Bayesian —  approach  are  two  methods  that  can  be  used  to  calculate  probabilities. 
The  fact  that  there  is  more  than  one  method  available  for  this  purpose  should  not 
be  viewed  as  a  limitation  of  the  theory,  but  rather  as  the  fact  that  for  certain  parts 
of  the  theory  of  probability,  and  even  more  so  for  statistics,  there  is  an  element 
of  subjectivity  that  enters  the  analysis  and  the  interpretation  of  the  results.  It  is 
therefore  the  task  of  the  statistician  to  keep  track  of  any  assumptions  made  in  the 
analysis,  and  to  account  for  them  in  the  interpretation  of  the  results.  Once  a  method 
for  assigning  probabilities  is  established,  the  Kolmogorov  axioms  are  introduced 
as  the  “rules”  required  to  manipulate  probabilities.  Fundamental  results  known  as 
Bayes’  theorem  and  the  theorem  of  total  probability  are  used  to  define  and  interpret 
the  concepts  of  statistical  independence  and  of  conditional  probability,  which  play 
a  central  role  in  much  of  the  material  presented  in  this  book. 


1.1  Experiments,  Events,  and  the  Sample  Space 

Every  experiment  has  a  number  of  possible  outcomes.  For  example,  the  experiment 
consisting  of  the  roll  of  a  die  can  have  six  possible  outcomes,  according  to  the 
number  that  shows  after  the  die  lands.  The  sample  space  £2  is  defined  as  the  set  of 
all  possible  outcomes  of  the  experiment,  in  this  case  =  {1,2,  3,  4,  5,  6}.  An  event 
A  is  a  subset  of  f2,  A  c  Q,  and  it  represents  a  number  of  possible  outcomes  for  the 
experiment.  For  example,  the  event  “even  number”  is  represented  by  A  =  {2,4,6}, 
and  the  event  “odd  number”  as  B  —  {1, 3,  5}.  For  each  experiment,  two  events 
always  exist:  the  sample  space  itself,  Q,  comprising  all  possible  outcomes,  and 
A  =  0,  called  the  impossible  event ,  or  the  event  that  contains  no  possible  outcome. 
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1  Theory  of  Probability 


Events  are  conveniently  studied  using  set  theory,  and  the  following  definitions 

are  very  common  in  theory  of  probability: 

•  The  complementary  A  of  an  event  A  is  the  set  of  all  possible  outcomes  except 
those  in  A.  For  example,  the  complementary  of  the  event  “odd  number”  is  the 
event  “even  number.” 

•  Given  two  events  A  and  B ,  the  union  C  =  A  U  B  is  the  event  comprising  all 
outcomes  of  A  and  those  of  B.  In  the  roll  of  a  die,  the  union  of  odd  and  even 
numbers  is  the  sample  space  itself,  consisting  of  all  possible  outcomes. 

•  The  intersection  of  two  events  C  =  A  D  B  is  the  event  comprising  all  outcomes 
of  A  that  are  also  outcomes  of  B.  When  A  D  B  =  0,  the  events  are  said  to  be 
mutually  exclusive.  The  union  and  intersection  can  be  naturally  extended  to  more 
than  two  events. 

•  A  number  of  events  A*  are  said  to  be  a  partition  of  the  sample  space  if  they  are 

mutually  exclusive,  and  if  their  union  is  the  sample  space  itself,  U A/  =  . 

•  When  all  outcomes  in  A  are  comprised  in  B,  we  will  say  that  A  C  B  or  B  D  A. 


1.2  Probability  of  Events 

The  probability  P  of  an  event  describes  the  odds  of  occurrence  of  an  event  in  a 
single  trial  of  the  experiment.  The  probability  is  a  number  between  0  and  1 ,  where 
P  —  0  corresponds  to  an  impossible  event,  and  P  —  1  to  a  certain  event.  Therefore 
the  operation  of  “probability”  can  be  thought  of  as  a  function  that  transforms  each 
possible  event  into  a  real  number  between  0  and  1 . 


1.2.1  The  Kolmogorov  Axioms 

The  first  step  to  determine  the  probability  of  the  events  associated  with  a  given 
experiment  is  to  establish  a  number  of  basic  rules  that  capture  the  meaning  of 
probability.  The  probability  of  an  event  is  required  to  satisfy  the  three  axioms 
defined  by  Kolmogorov  [26] : 

1.  The  probability  of  an  event  A  is  a  non-negative  number,  P(A)  >  0; 

2.  The  probability  of  all  possible  outcomes,  or  sample  space,  is  normalized  to  the 
value  of  unity,  P(£2)  =  1; 

3.  If  A  C  Q  and  B  C  £2  are  mutually  exclusive  events,  then 

P(A  U  B)  =  P(A)  +  P(B)  (1.1) 

Figure  1.1  illustrates  this  property  using  set  diagrams.  For  events  that  are  not 
mutually  exclusive,  this  property  does  not  apply.  The  probability  of  the  union  is 
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Fig.  1.1  The  probability  of  the  event  P(A  U  B)  is  the  sum  of  the  two  individual  probabilities,  only 
if  the  two  events  are  mutually  exclusive.  This  property  enables  the  interpretation  of  probability  as 
the  “area”  of  a  given  event  within  the  sample  space 


represented  by  the  area  of  A  U  B,  and  the  outcomes  that  overlap  both  events  are 
not  double-counted. 

These  axioms  should  be  regarded  as  the  basic  “ground  rules”  of  probability,  but 
they  provide  no  unique  specification  on  how  event  probabilities  should  be  assigned. 
Two  major  avenues  are  available  for  the  assignment  of  probabilities.  One  is  based  on 
the  repetition  of  the  experiments  a  large  number  of  times  under  the  same  conditions, 
and  goes  under  the  name  of  the  frequentist  or  classical  method.  The  other  is  based 
on  a  more  theoretical  knowledge  of  the  experiment,  but  without  the  experimental 
requirement,  and  is  referred  to  as  the  Bayesian  approach. 


1.2.2  Frequentist  or  Classical  Method 

Consider  performing  an  experiment  for  a  number  N  1  of  times,  under  the  same 
experimental  conditions,  and  measuring  the  occurrence  of  the  event  A  as  the  number 
N(A).  The  probability  of  event  A  is  given  by 


,  x  N(A) 

P(A)  =  lim  -4/;  (1.2) 

that  is,  the  probability  is  the  relative  frequency  of  occurrence  of  a  given  event  from 
many  repetitions  of  the  same  experiment.  The  obvious  limitation  of  this  definition 
is  the  need  to  perform  the  experiment  an  infinite  number  of  times,  which  is  not  only 
time  consuming,  but  also  requires  the  experiment  to  be  repeatable  in  the  first  place, 
which  may  or  may  not  be  possible. 

The  limitation  of  this  method  is  evident  by  considering  a  coin  toss:  no  matter  the 
number  of  tosses,  the  occurrence  of  heads  up  will  never  be  exactly  50  %,  which  is 
what  one  would  expect  based  on  a  knowledge  of  the  experiment  at  hand. 
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1.2.3  Bayesian  or  Empirical  Method 

Another  method  to  assign  probabilities  is  to  use  the  knowledge  of  the  experiment 
and  the  event,  and  the  probability  one  assigns  represents  the  degree  of  belief  that  the 
event  will  occur  in  a  given  try  of  the  experiment.  This  method  implies  an  element 
of  subjectivity,  which  will  become  more  evident  in  Bayes’  theorem  (see  Sect.  1.7). 
The  Bayesian  probability  is  assigned  based  on  a  quantitative  understanding  of 
the  nature  of  the  experiment,  and  in  accord  with  the  Kolmogorov  axioms.  It  is 
sometimes  referred  to  as  empirical  probability,  in  recognition  of  the  fact  that 
sometimes  the  probability  of  an  event  is  assigned  based  upon  a  practical  knowledge 
of  the  experiment,  although  without  the  classical  requirement  of  repeating  the 
experiment  for  a  large  number  of  times.  This  method  is  named  after  the  Rev.  Thomas 
Bayes,  who  pioneered  the  development  of  the  theory  of  probability  [3]. 

Example  1.1  ( Coin  Toss  Experiment )  In  the  coin  toss  experiment,  the  determi¬ 
nation  of  the  empirical  probability  for  events  “heads  up”  or  “tails  up”  relies  on 
the  knowledge  that  the  coin  is  unbiased,  and  that  therefore  it  must  be  true  that 
P(tails)  —  P (heads).  This  empirical  statement  signifies  the  use  of  the  Bayesian 
method  to  determine  probabilities.  With  this  information,  we  can  then  simply  use 
the  Kolmogorov  axioms  to  state  that  P(tails )  +  P (heads)  —  1,  and  therefore  obtain 
the  intuitive  result  that  P(tails)  —  PQieads)  —1/2.  O 


1.3  Fundamental  Properties  of  Probability 

The  following  properties  are  useful  to  improve  our  ability  to  assign  and  manipulate 
event  probabilities.  They  are  somewhat  intuitive,  but  it  is  instructive  to  derive  them 
formally  from  the  Kolmogorov  axioms. 

1.  The  probability  of  the  null  event  is  zero,  P(0)  =  0. 

Proof  Start  with  the  mutually  exclusive  events  0  and  £2.  Since  their  union  is  i 2 , 
it  follows  from  the  Third  Axiom  that  P(£2)  —  P(£2)  +  P(0).  From  the  Second 
Axiom  we  know  that  P(£2)  —  1,  from  this  it  follows  that  P(0)  =  0.  □ 

The  following  property  is  a  generalization  of  the  one  described  above: 

2.  The  probability  of  the  complementary  event  A  satisfies  the  property 

P(A)  =  1  -P(A).  (1.3) 

Proof  By  definition,  it  is  true  that  A  U  A  =  £ 2 ,  and  that  A,  A  are  mutually 
exclusive.  Using  the  Second  and  Third  axiom,  P(A  U  A)  =  P(A)  +  P(A)  —  1, 
from  which  it  follows  that  P(A)  =  1  —  P(A).  □ 
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3.  The  probability  of  the  union  of  two  events  satisfies  the  general  property  that 

P(A  U  B)  =  P(A)  +  P(B )  -  P(A  n  B ).  (1.4) 

This  property  generalizes  the  Third  Kolmogorov  axiom,  and  can  be  interpreted  as 
the  fact  that  outcomes  in  the  overlap  region  of  the  two  events  should  be  counted 
only  once,  as  illustrated  in  Fig.  1.1. 

Proof  First,  realize  that  the  event  A  U  B  can  be  written  as  the  union  of  three 
mutually  exclusive  sets,  A  U  B  —  (A  D  B)  U  (B  PI  A)  U  (A  D  B),  see  Fig.  1.1. 
Therefore,  using  the  Third  axiom,  P(A  U  B)  —  P(A  0  5)+  P(B  D  A)  +  P{A  D  B). 

Then,  notice  that  for  any  event  A  and  B ,  it  is  true  that  A  =  (A  D  B)  U  (A  D  B), 
since  {B,  B}  is  a  partition  of  £2 .  This  implies  that  P (A)  =  P(An#)+P(AnZ?)  due 
to  the  fact  that  the  two  sets  are  again  mutually  exclusive,  and  likewise  for  event  B. 
It  thus  follows  that  P(AU  B)  =  P(A)-P(AnB)+P(B)-P(BnA)+P(AnB)  = 
P(A)  +  P(B)  —  P(A  Pi  B).  □ 

Example  1.2  An  experiment  consists  of  drawing  a  number  between  1  and  100  at 
random.  Calculate  the  probability  of  the  event:  “drawing  either  a  number  greater 
than  50,  or  an  odd  number,  at  each  try.” 

The  sample  space  for  this  experiment  is  the  set  of  numbers  i  —  1, . . . ,  100,  and 
the  probability  of  drawing  number  i  is  P(Ai)  =  1/100,  since  we  expect  that  each 
number  will  have  the  same  probability  of  being  drawn  at  each  try.  At  is  the  event 
that  consists  of  drawing  number  i.  If  we  call  B  the  event  consisting  of  all  numbers 
greater  than  50,  and  C  the  event  with  all  odd  numbers,  it  is  clear  that  P(B)  =  0.5, 
and  likewise  P(C)  —  0.5.  The  event  A  fl  B  contains  all  odd  numbers  greater  than  50, 
and  therefore  P(A  Pi  B)  —  0.25.  Using  (1.4),  we  find  that  the  probability  of  drawing 
either  a  number  greater  than  50,  or  an  odd  number,  is  0.75.  This  can  be  confirmed 
by  a  direct  count  of  the  possible  outcomes.  O 


1.4  Statistical  Independence 

Statistical  independence  among  events  means  that  the  occurrence  of  one  event  has 
no  influence  on  the  occurrence  of  other  events.  Consider,  for  example,  rolling  two 
dice,  one  after  the  other:  the  outcome  of  one  die  is  independent  of  the  other,  and 
the  two  tosses  are  said  to  be  statistically  independent.  On  the  other  hand,  consider 
the  following  pair  of  events:  the  first  is  the  roll  of  die  1,  and  the  second  is  the  roll 
of  die  1  and  die  2,  so  that  for  the  second  event  we  are  interested  in  the  sum  of  the 
two  tosses.  It  is  clear  that  the  outcome  of  the  second  event — e.g.,  the  sum  of  both 
dice — depends  on  the  first  toss,  and  the  two  events  are  not  independent. 

Two  events  A  and  B  are  said  to  be  statistically  independent  if  and  only  if 


P(A  n  B)  =  P(A)  •  P(B). 


(1.5) 
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At  this  point,  it  is  not  obvious  that  the  concept  of  statistical  independence  is 
embodied  by  (1.5).  A  few  examples  will  illustrate  the  meaning  of  this  definition, 
which  will  be  explored  further  in  the  following  section  on  conditional  probability. 

Example  1.3  Determine  the  probability  of  obtaining  two  3  when  rolling  two  dice. 
This  event  can  be  decomposed  in  two  events:  A  —  {die  1  shows  3  and  die  2  shows 
any  number}  and  B  —  {die  2  shows  3  and  die  1  shows  any  number}. 

It  is  natural  to  assume  that  P(A)  =  1/  6,  P(B )  =  1/  6  and  state  that  the  two  events 
A  and  B  are  independent  by  nature,  since  each  event  involves  a  different  die,  which 
has  no  knowledge  of  the  other  one.  The  event  we  are  interested  in  is  C  =  A  D  B  and 
the  definition  of  probability  of  two  statistically  independent  events  leads  to  P(C )  = 
P(A  fl  B)  —  P{A)  •  P(B)  —  1/36.  This  result  can  be  confirmed  by  the  fact  that  there 
is  only  one  combination  out  of  36  that  gives  rise  to  two  consecutive  3.  <> 

The  example  above  highlights  the  importance  of  a  proper,  and  sometimes 
extended,  definition  of  an  event.  The  more  careful  the  description  of  the  event  and  of 
the  experiment  that  it  is  drawn  from,  the  easier  it  is  to  make  probabilistic  calculation 
and  the  assessment  of  statistical  independence. 

Example  1.4  Consider  the  events  A  —  {die  1  shows  3  and  die  2  shows  any  number} 
and  B  —  {the  sum  of  the  two  dice  is  9}.  Determine  whether  they  are  statistically 
independent. 

In  this  case,  we  will  calculate  the  probability  of  the  two  events,  and  then  check 
whether  they  obey  (1.5)  or  not.  This  calculation  will  illustrate  that  the  two  events 
are  not  statistically  independent. 

Event  A  has  a  probability  E(A)  =  1/6;  in  order  to  calculate  the  probability 
of  event  B ,  we  realize  that  a  sum  of  9  is  given  by  the  following  combinations  of 
outcomes  of  the  two  rolls:  (3,6),  (4,5),  (5,4)  and  (6,3).  Therefore,  P{B)  —  1/9. 
The  event  A  D  B  is  the  situation  in  which  both  event  A  and  B  occur,  which 
corresponds  to  the  single  combination  (3,6);  therefore,  P(A  fl  B)  —  1/36.  Since 
P(A)  •  P(B)  =  1/6  •  1/9  =  1/54  ^  P(A  11  B)  =  1/36,  we  conclude  that  the 
two  events  are  not  statistically  independent.  This  conclusion  means  that  one  event 
influences  the  other,  since  a  3  in  the  first  toss  has  certainly  an  influence  on  the 
possibility  of  both  tosses  having  a  total  of  9.  O 

There  are  two  important  necessary  (but  not  sufficient)  conditions  for  statistical 
independence  between  two  events.  These  properties  can  help  identify  whether  two 
events  are  independent. 

1.  If  A  fl  B  =  0,A  and  B  cannot  be  independent,  unless  one  is  the  empty  set.  This 
property  states  that  there  must  be  some  overlap  between  the  two  events,  or  else  it 
is  not  possible  for  the  events  to  be  independent. 

Proof  For  A  and  B  to  be  independent,  it  must  be  true  that  P{A  DB)  =  P(A)  • P(B ), 
which  is  zero  by  hypothesis.  This  can  be  true  only  if  E(A)  =  0  or  P(B)  —  0, 
which  in  turn  means  A  =  0orZ?  =  0asa  consequence  of  the  Kolmogorov 

□ 


axioms. 
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2.  If  A  C  B ,  then  A  and  B  cannot  be  independent,  unless  B  is  the  entire  sample 
space.  This  property  states  that  the  overlap  between  two  events  cannot  be  such 
that  one  event  is  included  in  the  other,  in  order  for  statistical  independence  to  be 
possible. 

Proof  In  order  for  A  and  B  to  be  independent,  it  must  be  that  P(A  D  B)  —  P(A)  • 
P(B)  —  P(A ),  given  that  A  C  B.  This  can  only  be  true  if  B  =  £2,  since  P(k2)  =  1. 

□ 

Example  1.5  Consider  the  above  Example  1.3  of  the  roll  of  two  dice;  each  event 
was  formulated  in  terms  of  the  outcome  of  both  rolls,  to  show  that  there  was  in  fact 
overlap  between  two  events  that  are  independent  of  one  another.  O 

Example  1.6  Consider  the  following  two  events:  A  =  {die  1  shows  3  and  die  2 
shows  any  number}  and  B  —  {die  1  shows  3  or  2  and  die  2  shows  any  number}.  It 
is  clear  that  A  c  B,  P(A)  =  1/6  and  P{B)  —  1/3.  The  event  A  D  B  is  identical  to 
A  and  P(A  D  B)  —  1/6.  Therefore  P(A  D  B)  ^  P(A)  •  P(B)  and  the  two  events 
are  not  statistically  independent.  This  result  can  be  easily  explained  by  the  fact 
that  the  occurrence  of  A  implies  the  occurrence  of  B ,  which  is  a  strong  statement 
of  dependence  between  the  two  events.  The  dependence  between  the  two  events 
can  also  be  expressed  with  the  fact  that  the  non-occurrence  of  B  implies  the  non¬ 
occurrence  of  A.  O 


1.5  Conditional  Probability 


The  conditional  probability  describes  the  probability  of  occurrence  of  an  event  A 
given  that  another  event  B  has  occurred  and  it  is  indicated  as  P(A/B ).  The  symbol 
indicates  the  statement  given  that  or  knowing  that.  It  states  that  the  event  after  the 
symbol  is  known  to  have  occurred.  When  two  or  more  events  are  not  independent, 
the  probability  of  a  given  event  will  in  general  depend  on  the  occurrence  of  another 
event.  For  example,  if  one  is  interested  in  obtaining  a  12  in  two  consecutive  rolls  of 
a  die,  the  probability  of  such  event  does  rely  on  the  fact  that  the  first  roll  was  (or 
was  not)  a  6. 

The  following  relationship  defines  the  conditional  probability: 

P(A  n  B)  =  P(A/B)  •  P(B )  =  P(B/A)  •  P(A);  (1.6) 


Equation  (1.6)  can  be  equivalently  expressed  as 


P(A  fl  B) 
P(B) 


P(A/B)  = 


0 


if  p{B)  ±  0 
if  P(B)  =  0. 


(1.7) 
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A  justification  for  this  definition  is  that  the  occurrence  of  B  means  that  the 
probability  of  occurrence  of  A  is  that  of  A  D  B.  The  denominator  of  the  conditional 
probability  is  P(B)  because  B  is  the  set  of  all  possible  outcomes  that  are  known  to 
have  happened.  The  situation  is  also  depicted  in  the  right-hand  side  panel  of  Fig.  1.1: 
knowing  that  B  has  occurred,  leaves  the  probability  of  occurrence  of  A  to  the 
occurrence  of  the  intersection  Afl B,  out  of  all  outcomes  in  B.  It  follows  directly  from 
(1.6)  that  if  A  and  B  are  statistically  independent,  then  the  conditional  probability  is 
P(A/B )  =  P(A),  i.e.,  the  occurrence  of  B  has  no  influence  on  the  occurrence  of  A. 
This  observation  further  justifies  the  definition  of  statistical  independence  according 
to  (1.5). 

Example  1.7  Calculate  the  probability  of  obtaining  8  as  the  sum  of  two  rolls  of  a 
die,  given  that  the  first  roll  was  a  3. 

Call  event  A  the  sum  of  8  in  two  separate  rolls  of  a  die  and  event  B  the  event 
that  the  first  roll  is  a  3.  Event  A  is  given  by  the  probability  of  having  tosses  (2,6), 
(3,5),  (4,4),  (5,3),  (6,2).  Since  each  such  combination  has  a  probability  of  1/36, 
P(A)  =  5/36.  The  probability  of  event  B  is  P(B )  =  1/6.  Also,  the  probability  of 
A  H  B  is  the  probability  that  the  first  roll  is  a  3  and  the  sum  is  8,  which  can  clearly 
occur  only  if  a  sequence  of  (3,5)  takes  place,  with  probability  P(A  D  B)  —  1/36. 

According  to  the  definition  of  conditional  probability,  P{A/B)  —  P(A  D 
B)/P(B)  —  6/36  =  1/6,  and  in  fact  only  combination  (5,3) — of  the  six  available 
with  3  as  the  outcome  of  the  second  toss — gives  rise  to  a  sum  of  8.  The  occurrence 
of  3  in  the  first  roll  has  therefore  increased  the  probability  of  A  from  P(A)  =  5/36  to 
P(A/B )  =  1/6,  since  not  any  outcome  of  the  first  roll  would  be  equally  conducive 
to  a  sum  of  8  in  two  rolls.  <> 


1.6  A  Classic  Experiment:  Mendel’s  Law  of  Heredity 
and  the  Independent  Assortment  of  Species 

The  experiments  performed  in  the  nineteenth  century  by  Gregor  Mendel  in 
the  monastery  of  Brno  led  to  the  discovery  that  certain  properties  of  plants, 
such  as  seed  shape  and  color,  are  determined  by  a  pair  of  genes.  This  pair  of 
genes,  or  genotype ,  is  formed  by  the  inheritance  of  one  gene  from  each  of  the 
parent  plants. 

Mendel  began  by  crossing  two  pure  lines  of  pea  plants  which  differed  in 
one  single  characteristic.  The  first  generation  of  hybrids  displayed  only  one 
of  the  two  characteristics,  called  the  dominant  character.  For  example,  the 
first-generation  plants  all  had  round  seed,  although  they  were  bred  from  a 
population  of  pure  round  seed  plants  and  one  with  wrinkled  seed.  When  the 
first-generation  was  allowed  to  self-fertilize  itself,  Mendel  observed  the  data 
shown  in  Table  1.1  [31]. 


(continued) 


1.6  A  Classic  Experiment:  Mendel’s  Law  of  Heredity  and  the  Independent. . . 


9 


Table  1.1  Data  from  G.  Mendel’s  experiment 


Character 

No.  of  dominant 

No.  of  recessive 

Lract.  of  dominant 

Round  vs.  wrinkled  seed 

5474 

1850 

0.747 

Yellow  vs.  green  seed 

6022 

2001 

0.751 

Violet-red  vs.  white  flower 

705 

224 

0.759 

Inflated  vs.  constricted  pod 

882 

299 

0.747 

Green  vs.  yellow  unripe  pod 

428 

152 

0.738 

Axial  vs.  terminal  flower 

651 

207 

0.759 

Long  vs.  short  stem 

787 

277 

0.740 

Table  1.2  Data  from  G.  Mendel’s  experiment  for  plants  with  two  different  characters 


Yellow  seed 

Green  seed 

Round  seed 

315 

108 

Wrinkled  seed 

101 

32 

In  addition,  Mendel  performed  experiments  in  which  two  pure  lines  that 
differed  by  two  characteristics  were  crossed.  In  particular,  a  line  with  yellow 
and  round  seed  was  crossed  with  one  that  had  green  and  wrinkled  seeds. 
As  in  the  previous  case,  the  first-generation  plants  had  a  100%  occurrence 
of  the  dominant  characteristics,  while  the  second-generation  was  distributed 
according  to  the  data  in  Table  1.2. 

One  of  the  key  results  of  these  experiments  goes  under  the  name  of  Law 
of  independent  assortment ,  stating  that  a  daughter  plant  inherits  one  gene 
from  each  parent  plant  independently  of  the  other  parent.  If  we  denote  the 
genotype  of  the  dominant  parent  as  DD  (a  pair  of  dominant  genes)  and  that 
of  the  recessive  parent  as  RR,  then  the  data  accumulated  by  Mendel  support 
the  hypothesis  that  the  first- generation  plants  will  have  the  genotype  DR 
(the  order  of  genes  in  the  genome  is  irrelevant)  and  the  second  generation 
plants  will  have  the  following  four  genotypes:  DD,  DR,  RD  and  RR,  in 
equal  proportions.  Since  the  first  three  genomes  will  display  the  dominant 
characteristic,  the  ratio  of  appearance  of  the  dominant  characteristic  is 
expected  to  be  0.75.  The  data  appear  to  support  in  full  this  hypothesis. 

In  probabilistic  terms,  one  expects  that  each  second-generation  plant  has 
P(D)  —  0.5  of  drawing  a  dominant  first  gene  from  each  parent  and  P(R )  = 
0.5  of  drawing  a  recessive  gene  from  each  parent.  Therefore,  according  to  the 


(continued) 
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hypothesis  of  independence  in  the  inheritance  of  genes,  we  have 

P(DD )  =  P(D )  •  P(D)  =  0.25 
P(DR)  =  P(D)  -  P(R)  =  0.25 

(1.8) 

P(RD )  =  P(R )  •  P(D)  =  0.25 
P(flfl)  =  P(tf)  •  P(R )  =  0.25. 

When  plants  differing  by  two  characteristics  are  crossed,  as  in  the  case 
of  the  data  in  Table  1.2,  then  each  of  the  four  events  in  (1.8)  is  indepen¬ 
dently  mixed  between  the  two  characters.  Therefore,  there  is  a  total  of  16 
possibilities,  which  give  rise  to  4  possible  combinations  of  the  two  characters. 
For  example,  a  display  of  both  recessive  characters  will  have  a  probability 
of  1/16  =  0.0625.  The  data  seemingly  support  this  hypothesis  with  a 
measurement  of  a  fraction  of  0.0576. 


1.7  The  Total  Probability  Theorem  and  Bayes’  Theorem 


In  this  section  we  describe  two  theorems  that  are  of  great  importance  in  a  number  of 
practical  situations.  They  make  use  of  a  partition  of  the  sample  space  Q,  consisting 
of  n  events  A/  that  satisfy  the  following  two  properties: 


At  HAj  =  0,  V/ 

n 

U  a,  =  S2. 

i=  1 


(1.9) 


For  example,  the  outcomes  1 ,  2,  3, 4,  5  and  6  for  the  roll  of  a  die  partition  the  sample 
space  into  a  number  of  events  that  cover  all  possible  outcomes,  without  any  overlap 
among  each  other. 

Theorem  1.1  (Total  Probability  Theorem)  Given  an  event  B  and  a  set  of  events 
Ai  with  the  properties  (1.9), 


n  n 

P(B )  =  Ai)  =  ^  P(B/ Ai)  ■  P(At) .  (1.10) 

i=  1  i=  1 

Proof  The  first  equation  is  immediately  verified  given  that  the  B  D  A/  are  mutually 
exclusive  events  such  that  B  —  U i(B  H  A/).  The  second  equation  derives  from  the 
application  of  the  definition  of  conditional  probability.  □ 


1.7  The  Total  Probability  Theorem  and  Bayes’  Theorem 


11 


The  total  probability  theorem  is  useful  when  the  probability  of  an  event  B  cannot 
be  easily  calculated  and  it  is  easier  to  calculate  the  conditional  probability  B/At 
given  a  suitable  set  of  conditions  A/.  The  example  at  the  end  of  Sect.  1.7  illustrates 
one  such  situation. 

Theorem  1.2  (Bayes’  Theorem)  Given  an  event  B  and  a  set  of  events  At  with 
properties  (1.9), 


P(Ai/B ) 


P(B/Ai)P(Ai ) 
P(B) 


P(B/Ai)P{Ai) 

n 

J2P(B  n  A<) 

i=  1 


(i.ii) 


Proof  The  proof  is  an  immediate  consequence  of  the  definition  of  conditional 
probability,  (1.6),  and  of  the  Total  Probability  theorem,  (1.10).  □ 

Bayes’  theorem  is  often  written  in  a  simpler  form  by  taking  into  account  two 
events  only,  At  —  A  and  B: 


P(A/B)  = 


P(B/A)P(A) 

W) 


(1.12) 


In  this  form,  Bayes’  theorem  is  just  a  statement  of  how  the  order  of  conditioning 
between  two  events  can  be  inverted. 

Equation  (1.12)  plays  a  central  role  in  probability  and  statistics.  What  is 
especially  important  is  the  interpretation  that  each  term  assumes  within  the  context 
of  a  specific  experiment.  Consider  B  as  the  data  collected  in  a  given  experiment — 
these  data  can  be  considered  as  an  event,  containing  the  outcome  of  the  experiment. 
The  event  A  is  a  model  that  is  used  to  describe  the  data.  The  model  can  be  considered 
as  an  ideal  outcome  of  the  experiment,  therefore  both  A  and  B  are  events  associated 
with  the  same  experiment.  Following  this  interpretation,  the  quantities  involved  in 
Bayes’  theorem  can  be  interpreted  as  in  the  following: 

•  P(B/A )  is  the  probability,  or  likelihood ,  of  the  data  given  the  specified  model, 
and  indicated  as  Jzf .  The  likelihood  represents  the  probability  of  making  the 
measurement  B  given  that  the  model  A  is  a  correct  description  of  the  experiment. 

•  P(A)  is  the  probability  of  the  model  A,  without  any  knowledge  of  the  data.  This 
term  is  interpreted  as  a  prior  probability ,  or  the  degree  belief  that  the  model 
is  true  before  the  measurements  are  made.  Prior  probabilities  should  be  based 
upon  quantitative  knowledge  of  the  experiment,  but  can  also  reflect  the  subjective 
belief  of  the  analyst.  This  step  in  the  interpretation  of  Bayes’  theorem  explicitly 
introduces  an  element  of  subjectivity  that  is  characteristic  of  Bayesian  statistics. 

•  P(B)  is  the  probability  of  collecting  the  dataset  B.  In  practice,  this  probability  acts 
as  a  normalization  constant  and  its  numerical  value  is  typically  of  no  practical 
consequence. 

•  Finally,  P(A/B )  is  the  probability  of  the  model  after  the  data  have  been  collected. 
This  is  referred  to  as  the  posterior  probability  of  the  model.  The  posterior 
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probability  is  the  ultimate  goal  of  a  statistical  analysis,  since  it  describes  the 
probability  of  the  model  based  on  the  collection  of  data.  According  to  the  value 
of  the  posterior  probability,  a  model  can  be  accepted  or  discarded. 

This  interpretation  of  Bayes’  theorem  is  the  foundation  of  Bayesian  statistics. 
Models  of  an  experiment  are  usually  described  in  terms  of  a  number  of  parameters. 
One  of  the  most  common  problems  of  statistical  data  analysis  is  to  estimate  what 
values  for  the  parameters  are  permitted  by  the  data  collected  from  the  experiment. 
Bayes’  theorem  provides  a  way  to  update  the  prior  knowledge  on  the  model 
parameters  given  the  measurements,  leading  to  posterior  estimates  of  parameters. 
One  key  feature  of  Bayesian  statistics  is  that  the  calculation  of  probabilities  are 
based  on  a  prior  probability,  which  may  rely  on  a  subjective  interpretation  of  what 
is  known  about  the  experiment  before  any  measurements  are  made.  Therefore,  great 
attention  must  be  paid  to  the  assignment  of  prior  probabilities  and  the  effect  of  priors 
on  the  final  results  of  the  analysis. 

Example  1.8  Consider  a  box  in  which  there  are  red  and  blue  balls,  for  a  total  of 
N  =  10  balls.  What  is  known  a  priori  is  just  the  total  number  of  balls  in  the 
box.  Of  the  first  3  balls  drawn  from  the  box,  2  are  red  and  1  is  blue  (drawing  is 
done  with  re-placement  of  balls  after  drawing).  We  want  to  use  Bayes’  theorem 
to  make  inferences  on  the  number  of  red  balls  (i)  present  in  the  box,  i.e.,  we  seek 
P(Ai/B ),  the  probability  of  having  i  red  balls  in  the  box,  given  that  we  performed 
the  measurement  B  —  {Two  red  balls  were  drawn  in  the  first  three  trials}. 

Initially,  we  may  assume  that  P(A; )  =  1/11,  meaning  that  there  is  an  equal 
probability  of  having  0,  1 , ...  or  10  red  balls  in  the  box  (for  a  total  of  1 1  possibilities) 
before  we  make  any  measurements.  Although  this  is  a  subjective  statement,  a 
uniform  distribution  is  normally  the  logical  assumption  in  the  absence  of  other 
information.  We  can  use  basic  combinatorial  mathematics  to  determine  that  the 
likelihood  of  drawing  D  —  2  red  balls  out  of  T  —  3  trials,  given  that  there  are  i 
red  balls  (also  called  event  A/): 


P(B/A.)  = 


pDqT~D- 


(1.13) 


In  this  equation  p  is  the  probability  of  drawing  one  of  the  red  balls  in  a  given  drawing 
assuming  that  there  are  i  red  balls,  p  —  i/N ,  and  q  is  the  probability  of  drawing  one 
of  the  blue  balls,  q  —  1  —p  =  (N  —  i)/N.  The  distribution  in  (1.13)  is  known  as  the 
binomial  distribution  and  it  will  be  derived  and  explained  in  more  detail  in  Sect.  3.1. 
The  likelihood  P{B/Ai)  can  therefore  be  rewritten  as 


P(B/Ai)  = 


(1.14) 


The  probability  P{B)  is  the  probability  of  drawing  D  —  2  red  balls  out  of  T  —  3 
trial,  for  all  possible  values  of  the  true  number  of  red  balls,  i  —  0, . . . ,  10.  This 
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probability  can  be  calculated  from  the  Total  Probability  theorem, 

N 

P(B)  =  YJP(B/A,)-p(Al)  (1.15) 

1=0 

We  can  now  put  all  the  pieces  together  and  determine  the  posterior  probability  of 
having  i  red  balls,  P{Ai/B ),  using  Bayes’  theorem,  P(Ai/B )  =  P(B / Ai)P(Ai) / P(B) . 

The  equation  above  is  clearly  a  function  of  /,  the  true  number  of  red  balls. 
Consider  the  case  of  i  —  0,  i.e.,  what  is  the  posterior  probability  of  having  no 
red  balls  in  the  box.  Since 


P(B/A0)  = 


it  follows  that  P(A0/B )  =  0,  i.e.,  it  is  impossible  that  there  are  no  red  balls.  This  is 
obvious,  since  two  times  a  red  ball  was  in  fact  drawn,  meaning  that  there  is  at  least 
one  red  ball  in  the  box.  Other  posterior  probabilities  can  be  calculated  in  a  similar 
way.  O 


Summary  of  Key  Concepts  for  this  Chapter 

□  Event:  A  set  of  possible  outcomes  of  an  experiment. 

□  Sample  space :  All  possible  outcomes  of  an  experiment. 

□  Probability  of  an  Event:  A  number  between  0  and  1  that  follows  the 
Kolmogorov  axioms. 

□  Erequentist  or  Classical  approach:  A  method  to  determine  the  probability 
of  an  event  based  on  many  repetitions  of  the  experiment. 

□  Bayesian  or  Empirical  approach:  A  method  to  determine  probabilities  that 
uses  prior  knowledge  of  the  experiment. 

□  Statistical  independence :  Two  events  are  statistically  independent  when 
the  occurrence  of  one  has  no  influence  on  the  occurrence  of  the  other,  P(A  n 
B)  =  P(A)P(B). 

□  Conditional  probability:  Probability  of  occurrence  of  an  event  given  that 
another  event  is  known  to  have  occurred,  P(A/B)  —  P(A  D  B)/P(B). 

□  Total  Probability  theorem:  A  relationship  among  probabilities  of  events 
that  form  a  partition  of  the  sample  space,  P(B)  =  P(B/Ai)P(Ai). 

□  Bayes’  theorem :  A  relationship  among  conditional  probabilities  that 
enables  the  change  in  the  order  of  conditioning  of  the  events,  P(A/B )  = 
P(B/A)P(A)/P(B). 
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Problems 

1.1  Describe  the  sample  space  of  the  experiment  consisting  of  flipping  four  coins 
simultaneously.  Assign  the  probability  to  the  event  consisting  of  “two  heads  up  and 
two  tails  up.”  In  this  experiment  it  is  irrelevant  to  know  which  specific  coin  shows 
heads  up  or  tails  up. 

1.2  An  experiment  consists  of  rolling  two  dice  simultaneously  and  independently 
of  one  another.  Find  the  probability  of  the  event  consisting  of  having  either  an  odd 
number  in  the  first  roll  or  a  total  of  9  in  both  rolls. 

1.3  In  the  roll  of  a  die,  find  the  probability  of  the  event  consisting  of  having  either 
an  even  number  or  a  number  greater  than  4. 

1.4  An  experiment  consists  of  rolling  two  dice  simultaneously  and  independently 
of  one  another.  Show  that  the  two  events,  “the  sum  of  the  two  rolls  is  8”  and  “the 
first  roll  shows  5”  are  not  statistically  independent. 

1.5  An  experiment  consists  of  rolling  two  dice  simultaneously  and  independently 
of  one  another.  Show  that  the  two  events,  “first  roll  is  even”  and  “second  roll  is 
even”  are  statistically  independent. 

1.6  A  box  contains  5  balls,  of  which  3  are  red  and  2  are  blue.  Calculate  (a)  the 
probability  of  drawing  two  consecutive  red  balls  and  (b)  the  probability  of  drawing 
two  consecutive  red  balls,  given  that  the  first  draw  is  known  to  be  a  red  ball.  Assume 
that  after  each  draw  the  ball  is  replaced  in  the  box. 

1.7  A  box  contains  10  balls  that  can  be  either  red  or  blue.  Of  the  first  three  draws, 
done  with  replacement,  two  result  in  the  draw  of  a  red  ball.  Calculate  the  ratio  of  the 
probability  that  there  are  2  or  just  1  red  ball  in  the  box  and  the  ratio  of  probability 
that  there  are  5  or  1  red  balls. 

1.8  In  the  game  of  baseball  a  player  at  bat  either  reaches  base  or  is  retired.  Consider 
three  baseball  players:  player  A  was  at  bat  200  times  and  reached  base  0.310  of 
times;  player  B  was  at  bat  250  times,  with  an  on-base  percentage  of  0.296;  player 
C  was  at  bat  300  times,  with  an  on-base  percentage  0.260.  Find  (a)  the  probability 
that  when  either  player  A,  B,  or  C  were  at  bat,  he  reached  base,  (b)  the  probability 
that,  given  that  a  player  reached  base,  it  was  A,  B,  or  C. 

1.9  An  experiment  consists  of  rolling  two  dice  simultaneously  and  independently 
of  one  another.  Calculate  (a)  the  probability  of  the  first  roll  being  a  1 ,  given  that  the 
sum  of  both  rolls  was  5,  (b)  the  probability  of  the  sum  being  5,  given  that  the  first 
roll  was  a  1  and  (c)  the  probability  of  the  first  roll  being  a  1  and  the  sum  being  5. 
Finally,  (d)  verify  your  results  with  Bayes’  theorem. 


1.7  The  Total  Probability  Theorem  and  Bayes’  Theorem 
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1.10  Four  coins  labeled  1  through  4  are  tossed  simultaneously  and  independently  of 
one  another.  Calculate  (a)  the  probability  of  having  an  ordered  combination  heads- 
tails-heads-tails  in  the  four  coins,  (b)  the  probability  of  having  the  same  ordered 
combination  given  that  any  two  coins  are  known  to  have  landed  heads-up  and  (c) 
the  probability  of  having  two  coins  land  heads  up  given  that  the  sequence  heads- 
tails-heads-tails  has  occurred. 


Chapter  2 

Random  Variables  and  Their  Distributions 


Abstract  The  purpose  of  performing  experiments  and  collecting  data  is  to  gain 
information  on  certain  quantities  of  interest  called  random  variables.  The  exact 
value  of  these  quantities  cannot  be  known  with  absolute  precision,  but  rather  we  can 
constrain  the  variable  to  a  given  range  of  values,  narrower  or  wider  according  to  the 
nature  of  the  variable  itself  and  the  type  of  experiment  performed.  Random  variables 
are  described  by  a  distribution  function,  which  is  the  theoretical  expectation  for  the 
outcome  of  experiments  aimed  to  measure  it.  Other  measures  of  the  random  variable 
are  the  mean,  variance,  and  higher-order  moments. 


2.1  Random  Variables 

A  random  variable  is  a  quantity  of  interest  whose  true  value  is  unknown.  To  gain 
information  on  a  random  variable  we  design  and  conduct  experiments.  It  is  inherent 
to  any  experiment  that  the  random  variable  of  interest  will  never  be  known  exactly. 
Instead,  the  variable  will  be  characterized  by  a  probability  distribution  function , 
which  determines  what  is  the  probability  that  a  given  value  of  the  random  variable 
occurs.  Repeating  the  measurement  typically  increases  the  knowledge  we  gain  of  the 
distribution  of  the  variable.  This  is  the  reason  for  wanting  to  measure  the  quantity 
as  many  times  as  possible. 

As  an  example  of  random  variable,  consider  the  gravitational  constant  G.  Despite 
the  label  of  “constant”,  we  only  know  it  to  have  a  range  of  possible  values  in  the 
approximate  interval  G  =  6.67428  ±  0.00067  (in  the  standard  S.I.  units).  This 
means  that  we  don’t  know  the  true  value  of  G,  but  we  estimate  the  range  of  possible 
values  by  means  of  experiments.  The  random  nature  of  virtually  all  quantities  lies 
primarily  in  the  fact  that  no  quantity  is  known  exactly  to  us  without  performing  an 
experiment  and  that  any  experiment  is  never  perfect  because  of  practical  or  even 
theoretical  limitations.  Among  the  practical  reasons  are,  for  example,  limitations  in 
the  precision  of  the  measuring  apparatus.  Theoretical  reasons  depend  on  the  nature 
of  the  variable.  For  example,  the  measurement  of  the  position  and  velocity  of  a 
subatomic  particle  is  limited  by  the  Heisenberg  uncertainty  principle,  which  forbids 
an  exact  knowledge  even  in  the  presence  of  a  perfect  measuring  apparatus. 

The  general  method  for  gaining  information  on  a  random  variable  X  starts  with 
set  of  measurements  xt,  ensuring  that  measurements  are  performed  under  the  same 
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Fig.  2.1  Example  of  data 
collected  to  measure  a 
random  variable  X.  The  500 
measurements  were  binned 
according  to  their  value  to 
construct  the  sample 
distribution.  The  shape  of  the 
distribution  depends  on  the 
nature  of  the  experiment  and 
of  the  number  of 
measurements 


experimental  conditions.  Throughout  the  book  we  will  reserve  uppercase  letters  for 
the  name  of  the  variable  itself  and  lowercase  letters  for  the  actual  measurements. 
From  these  measurements,  one  obtains  a  histogram  corresponding  to  the  frequency 
of  occurrence  of  all  values  of  X  (Fig.  2.1).  The  measurements  xt  form  the  sample 
distribution  of  the  quantity,  which  describes  the  empirical  distribution  of  values 
collected  in  the  experiment.  On  the  other  hand,  random  variables  are  typically 
expected  to  have  a  theoretical  distribution,  e.g.,  Gaussian,  Poisson,  etc.,  known 
as  the  parent  distribution.  The  parent  distribution  represents  the  belief  that  there 
is  an  ideal  description  of  a  random  variable  and  its  form  depends  on  the  nature 
of  the  variable  itself  and  the  method  of  measurement.  The  sample  distribution  is 
expected  to  become  the  parent  distribution  if  an  infinite  number  of  measurements 
are  performed,  in  such  a  way  that  the  randomness  associated  with  a  small  number 
of  measurements  is  eliminated. 

Example  2.1  In  Sect.  3.3  we  will  show  that  a  discrete  variable  (e.g.,  one  that  can 
only  take  integer  values)  that  describes  a  counting  experiment  follows  a  Poisson 
function, 


fin 

n\ 


e 


-M 


in  which  \i  is  the  mean  value  of  the  random  variable  (for  short,  its  true-yet-unknown 
value)  and  n  is  the  actual  value  measured  for  the  variable.  P{n)  indicates  the 
probability  of  measuring  the  value  n ,  given  that  the  true  value  is  p.  Consider  the 
experiment  of  counting  the  number  of  photons  reaching  Earth  from  a  given  star; 
due  to  a  number  of  factors,  the  count  may  not  always  be  the  same  every  time  the 
experiment  is  performed,  and  if  only  one  experiment  is  performed,  one  would  obtain 
a  sample  distribution  that  has  a  single  “bar”  at  the  location  of  the  measured  value 
and  this  sample  distribution  would  not  match  well  a  Poisson  function.  After  a  small 
number  of  measurements,  the  distribution  may  appear  similar  to  that  in  Fig.  2.1 


2.2  Probability  Distribution  Functions 
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and  the  distribution  will  then  become  smoother  and  closer  to  the  parent  distribution 
as  the  number  of  measurements  increases.  Repeating  the  experiment  therefore  will 
help  in  the  effort  to  estimate  as  precisely  as  possible  the  parameter  /x  that  determines 
the  Poisson  distribution.  O 


2.2  Probability  Distribution  Functions 

It  is  convenient  to  describe  random  variables  with  an  analytic  function  that 
determines  the  probability  of  the  random  variable  to  have  a  given  value.  Discrete 
random  variables  are  described  by  a  probability  mass  function  f(xt )  ,  where  f(xi) 
represents  the  probability  of  the  variable  to  have  an  exact  value  of  xt.  Continuous 
variables  are  described  by  a  probability  distribution  function  f(x),  such  that  f(x)dx  is 
the  probability  of  the  variable  to  have  values  in  the  interval  [x,  x+dx\.  For  simplicity 
we  will  refer  to  both  types  of  distributions  as  probability  distribution  functions 
throughout  the  book. 

Probability  distribution  functions  have  the  following  properties: 

1 .  They  are  normalized  to  1 .  For  continuous  variables  this  means 

/  +  oo 

f(x)dx  =  1 .  (2.1) 

-OO 

For  variables  that  are  defined  in  a  subset  of  the  real  numbers,  e.g.,  only  values 

v  >  0  or  in  a  finite  interval, /(v)  is  set  to  zero  outside  the  domain  of  definition  of 

the  function.  For  discrete  variables,  hereafter  the  integrals  are  replaced  by  a  sum 
over  all  values  that  the  function  of  integration  can  have. 

2.  The  probability  distribution  can  never  be  negative,  f(x)  >  0.  This  is  a 
consequence  of  the  Kolmogorov  axiom  that  requires  a  probability  to  be  non¬ 
negative. 

3.  The  function  F(x),  called  the  ( cumulative )  distribution  function, 

F{x)  =  f  f(z)dz,  (2.2) 

2  —  00 

represents  the  probability  that  the  variable  has  any  value  less  or  equal  than  v. 
F(x)  is  a  non-decreasing  function  of  x  that  starts  at  zero  and  has  its  highest  value 
of  one. 

Example  2.2  The  exponential  random  variable  follows  the  probability  distribution 
function  defined  by 


f(x)  —  Xe  Xx,  x  >  0, 


(2.3) 
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Random  Variable  X 


Fig.  2.2  The  distribution  function /(x)  (solid  line )  and  the  cumulative  distribution  function  F(x) 
(dashed  line )  for  an  exponential  variable  with  A  =  0.5 


where  A  is  an  adjustable  parameter  that  must  be  positive.  The  probability  distribution 
function  is  therefore /(v)  =  0  for  negative  values  of  the  variable.  The  cumulative 
distribution  function  is  given  by 


F{x)  =  1  -  e~Xx .  (2.4) 

In  Fig.  2.2  are  drawn  the  probability  distribution  function /(x)  and  the  cumulative 
distribution  function  F(x)  for  an  exponential  variable  with  A  =  0.5.  <> 


2.3  Moments  of  a  Distribution  Function 


The  probability  distribution  function /(x)  provides  a  complete  description  of  the 
random  variable.  It  is  convenient  to  find  a  few  quantities  that  describe  the  salient 
features  of  the  distribution.  The  moment  of  order  n ,  fin,  is  defined  as 


=/ 


E[Xn]  =  /  f(x)xndx. 


(2.5) 


The  moment  ptn  is  also  represented  as  E[Xn ],  the  expectation  of  the  function  Xn.  It 
is  possible  to  demonstrate,  although  mathematically  beyond  the  scope  of  this  book, 
that  the  knowledge  of  moments  of  all  orders  is  sufficient  to  determine  uniquely 
the  distribution  function  [42] .  This  is  an  important  fact,  since  it  shifts  the  problem 
of  determining  the  distribution  function  to  that  of  determining  at  least  some  of  its 
moments.  Moreover,  a  number  of  distribution  functions  only  have  a  few  non-zero 
moments,  and  this  renders  the  task  even  more  manageable. 


2.3  Moments  of  a  Distribution  Function 
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The  moments  or  expectations  of  a  distribution  are  theoretical  quantities  that  can 
be  calculated  from  the  probability  distribution /(x).  They  are  parent  quantities  that 
we  wish  to  estimate  via  measurements.  In  the  following  we  describe  the  two  main 
expectations,  the  mean  and  the  variance,  and  the  sample  quantities  that  approximate 
them,  the  sample  mean  and  the  sample  variance.  Chapter  5  describes  a  method  to 
justify  the  estimates  of  parent  quantities  via  sample  quantities. 


2.3.1  The  Mean  and  the  Sample  Mean 

The  moment  of  the  first  order  is  also  known  as  the  mean  or  expectation  of  the 
random  variable, 


xf(x)dx. 


(2.6) 


The  expectation  is  a  linear  operation  and  therefore  satisfies  the  property  that,  e.g., 


E[aX  +  bY]  =  aE[X]  +  bE[Y], 


(2.7) 


where  a  and  b  are  constants.  This  is  a  convenient  property  to  keep  in  mind  when 
evaluating  expectations  of  complex  functions  of  a  random  variable  X. 

To  estimate  the  mean  of  a  random  variable,  consider  N  measurements  xt  and 
define  the  sample  mean  as 


v  = 


(2.8) 


To  illustrate  that  the  sample  mean  v  defined  by  (2.8)  is  equivalent  to  the  mean 
/x,  consider  a  discrete  variable,  for  which 


M 

Em  = 

j=  i 


(2.9) 


where /(vj)  is  the  probability  distribution  function  and  we  have  assumed  that 
the  variable  can  only  have  M  possible  values.  According  to  the  classical 
interpretation  of  the  probability,  the  distribution  function  is  given  by 


N-+oo  N 
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in  which  N(pCj)  is  the  number  of  occurrence  of  the  value  Xj.  Since  XN(xj)xj  is 
the  value  obtained  in  N  measurements,  it  is  equivalent  to  Therefore  the 
sample  mean  will  be  identical  to  the  parent  mean  in  the  limit  of  an  infinite 
number  of  measurements, 


lim  v  = 

N-^oo 


lim 

N-^oo  N 


1  N 

-  T  Xl  = 

AT 


i=  1 


1 

lim  — 

N->oo  N 


M  M 

E  n(xoxj  =  T/w*  = E[x]- 

7=1  7=1 


A  proof  that  the  sample  mean  provides  an  unbiased  estimate  of  the  mean 
will  be  given  in  Chap.  5  for  Gaussian  and  Poisson  variables. 

The  sample  mean  is  therefore  a  representative  value  of  the  random  variable  that 
estimates  the  parent  mean  using  a  finite  number  of  measurements.  Other  measures  of 
a  random  variable  include  the  mode ,  defined  as  the  value  of  maximum  probability, 
and  the  median ,  defined  as  the  value  that  separates  the  lower  50  %  and  the  upper 
50  %  of  the  distribution  function.  For  distributions  that  are  symmetric  with  respect  to 
the  peak  value,  as  is  the  case  for  the  Gaussian  distribution  defined  below  in  Sect.  3.2, 
the  peak  value  coincides  with  the  mean,  median,  and  mode.  A  more  detailed  analysis 
of  the  various  measures  of  the  “average”  value  of  a  variable  is  described  in  Chap.  6. 


2.3.2  The  Variance  and  the  Sample  Variance 

The  variance  is  the  expectation  of  the  square  of  the  deviation  of  X  from  its  mean: 

/T  oo 

(v  —  ii)2f(x)dx  —  a2.  (2.10) 

-oo 

The  square  root  of  the  variance  is  referred  to  as  the  standard  deviation  or 
standard  error  o  and  it  is  a  common  measure  of  the  average  difference  of  a  given 
measurement  xi  from  the  mean  of  the  random  variable.  Notice  that  from  the  point  of 
view  of  physical  dimensions  of  the  moments  defined  by  (2.5),  moments  of  the  n- th 
order  have  the  dimensions  of  the  random  variable  to  the  n- th  power.  For  example, 
if  X  is  measured  in  meters,  the  variance  is  measured  in  meters  square  (m2),  thus  the 
need  to  use  the  square  root  of  the  variance  as  a  measure  of  the  standard  deviation  of 
the  variable  from  its  mean. 

The  main  reason  for  defining  the  average  difference  of  a  measurement  from 
its  mean  in  terms  of  a  moment  of  the  second  order  is  that  the  expectation  of  the 
deviation  X  —  ji  is  always  zero,  as  can  be  immediately  seen  using  the  linearity 
property  of  the  expectation.  The  deviation  of  a  random  variable  is  therefore  not  of 
common  use  in  statistics,  since  its  expectation  is  null. 
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The  sample  variance  is  defined  as 


,2 


(2.11) 


and  a  proof  that  this  quantity  is  an  unbiased  estimate  of  the  parent  variance  will 
be  provided  in  Chap.  5.  The  presence  of  a  factor  of  N  —  1,  and  not  just  N,  in  the 
denominator  of  the  sample  variance,  is  caused  by  the  fact  that  the  sample  variance 
requires  also  an  estimate  of  the  sample  mean,  since  the  exact  value  of  the  parent 
mean  is  unknown.  This  result  will  be  explained  further  in  Sect.  5.1.2. 

Using  the  linear  property  of  the  expectation,  it  is  straightforward  to  show  that  the 
following  property  applies: 


Var(X )  =  E[X 2]  -  p. 


(2.12) 


This  relationship  is  very  convenient  to  calculate  the  variance  from  the  moments  of 
the  first  and  second  order.  The  deviation  and  the  variance  are  moments  calculated 
with  respect  to  the  mean,  also  referred  to  as  central  moments. 

Another  useful  property  of  the  variance,  which  follows  from  the  fact  that  the 
variance  is  a  moment  of  the  second  order,  is 


Var(aX )  =  a2Var(X ) 


(2.13) 


where  a  is  a  constant. 

2.4  A  Classic  Experiment:  J.J.  Thomson’s  Discovery 
of  the  Electron 

A  set  of  experiments  by  J.J.  Thomson  in  the  late  nineteenth  century  were 
aimed  at  the  measurement  of  the  ratio  between  the  mass  and  charge  of  a 
new  lightweight  particle,  which  was  later  named  electron.  The  experiment 
was  truly  groundbreaking  not  just  for  the  method  used,  but  also  because  it 
revolutionized  our  understanding  of  physics  and  natural  sciences  by  proving 
that  the  new  particle  was  considerably  lighter  than  the  previously  known 
charge  carrier,  the  proton. 

The  experiment  described  in  this  book  was  reported  by  Thomson  in  [39]. 
It  consists  of  measuring  the  deflection  of  negatively  charged  cathode  rays  by 
a  magnetic  field  H  in  a  tube.  Thomson  wanted  to  measure  the  mass  m  of  the 
charged  particles  that  constituted  these  cathode  rays.  The  experiment  is  based 
on  the  measurement  of  the  following  quantities:  W  is  the  kinetic  energy  of  the 


(continued) 
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particles,  Q  —  Ne  is  the  amount  of  electricity  carried  by  the  particles  (N  is  the 
number  of  particles  and  e  the  charge  of  each  particle)  and  I  —  HR,  where  R 
is  the  radius  of  curvature  of  the  path  of  these  rays  in  a  magnetic  field  H.  The 
measurements  performed  by  Thomson  were  used  to  infer  the  ratio  m/e  and 
the  speed  v  of  the  new  lightweight  particle  according  to 


(2.14) 


For  the  purpose  of  the  data  analysis  of  this  experiment,  it  is  only  necessary 
to  know  that  W/Q  and  I  are  the  primary  quantities  being  measured,  and 
inferences  on  the  secondary  quantities  of  interest  are  based  on  (2.14).  For  the 
proton,  the  mass-to-charge  ratio  was  known  to  be  approximately  1  x  10-4  g  per 
electromagnetic  (EMU)  charge  unit,  where  the  EMU  charge  unit  is  equivalent 
to  1CT10  electrostatic  charge  units,  or  ESU  (a  more  common  unit  of  measure 
for  charge).  In  Thomson’s  units,  the  accepted  value  of  the  mass  to  charge  ratio 
of  the  electron  is  now  5.7  x  10-8.  Some  of  the  experimental  data  collected  by 
Thomson  are  reported  in  Tables  2.1  and  2.2,  in  which  “gas”  refers  to  the  gas 
used  in  the  tubes  he  used  for  the  experiment. 

Some  of  Thomson’s  conclusions  are  reported  here: 

(a)  “It  will  be  seen  from  these  tables  that  the  value  ofm/e  is  independent  of 
the  nature  of  the  gas”', 

(b)  “the  values  ofm/e  were,  however,  the  same  in  the  two  tubes.”', 

(c)  “for  the  first  tube,  the  me  an  for  air  is  0.40x  10  ~7,for  hydrogen  0.42  x  10-7 
and  for  carbonic  acid  0.4  x  10-7 

(d)  “for  the  second  tube,  the  mean  for  air  is  0.52  x  10-7,  for  hydrogen  0.50  x 
10-7  and  for  carbonic  acid  0.54  x  10-7 

Using  the  equations  for  sample  mean  and  variance  explained  in  Sect.  2.3, 
we  are  already  in  a  position  to  measure  the  sample  means  and  variances  in  air 
as  m/ex  —  0.42  and  s\  —  0.005  for  Tube  1,  %2  —  0-52  and  s\  —  0.003  for 
Tube  2.  These  statistics  can  be  reported  as  a  measurement  of  0.42  ±  0.07  for 
Tube  1  and  0.52  ±  0.06  for  Tube  2.  To  make  more  quantitative  statements  on 
the  statistical  agreement  between  the  two  measurements,  we  need  to  know 
what  is  the  probability  distribution  function  of  the  sample  mean.  The  test 
to  determine  whether  the  two  measurements  are  consistent  with  each  other 
will  be  explained  in  Sect.  7.5.  For  now,  we  simply  point  out  that  the  fact  that 
the  range  of  the  two  measurements  overlap,  is  an  indication  of  the  statistical 
agreement  of  the  two  measurements. 


(continued) 
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Note:  The  three  measurements  marked  with  a  star  appear  to  have  value  of 
v  or  m/ e  that  are  inconsistent  with  the  formulas  to  calculate  them  from  W/Q 
and  I.  They  may  be  typographical  errors  in  the  original  publication.  The  first 
appears  to  be  a  typo  in  W/Q  (6  x  1012  should  be  6  x  1011),  the  corrected  value 
is  assumed  throughout  this  book.  The  second  has  an  inconsistent  value  for  v 
(should  be  6.5  x  109,  not  7.5  x  109),  the  third  has  inconsistent  values  for  both  v 
and  m/e ,  but  no  correction  was  applied  in  these  cases  to  the  data  in  the  tables. 

Table  2.1  Data  from  Thomson’s  measurements  of  Tube  1 

Gas 

W/Q 

/ 

in/  e 

V 

Tube  1 

Air . 

4.6  x  1011 

230 

0.57  x  10~7 

4  x  109 

Air . 

1.8  x  1012 

350 

0.34  x  10“ 7 

O 

i-H 

o 

X 

Air . 

6.1  x  10u 

230 

0.43  x  10-7 

5.4  x  109 

Air . 

2.5  x  1012 

[400 

0.32  x  10-7 

1.2  x  1010 

Air . 

5.5  x  10“ 

230 

0.48  x  10-7 

4.8  x  109 

Air . 

1  x  1012 

285 

0.4  x  10-7 

7  x  109 

Air . 

1  x  1012 

285 

0.4  x  10-7 

7  x  109 

Hydrogen*  . 

6  x  1012 

205 

0.35  x  10-7 

6  x  109 

Hydrogen  . . 

2.1  x  1012 

460 

0.5  x  10“7 

9.2  x  109 

Carbonic  acid* 

8.4  x  10u 

260 

0.4  x  10-7 

7.5  x  109 

Carbonic  acid 

1.47  x  1012 

340 

0.4  x  10-7 

8.5  x  109 

Carbonic  acid 

3.0  x  1012 

480 

0.39  x  10“ 7 

1.3  x  1010 

See  Note  for  meaning  of  ★ 

Table  2.2  Data  from  Thomson’s  measurements  of  Tube  2 

Gas 

W/Q 

/ 

m/e 

V 

Tube  2 

Air  .... 

2.8  x  1011 

175 

0.53  x  10“7 

3.3  x  109 

Air*  .... 

2.8  x  1011 

175  n 

0.47  x  10“7 

4.1  x  109 

Air  .... 

3.5  x  10u 

181 

0.47  x  10“7 

3.8  x  109 

Hydrogen  . 

2.8  x  1011 

175 

0.53  x  10“7 

3.3  x  109 

Air  .... 

2.5  x  1011 

160 

0.51  x  10“7 

3.1  x  109 

Carbonic  acid 

2.0  x  1011 

148  n 

0.54  x  10“7 

2.5  x  109 

Air  .... 

1.8  x  1011 

151  n 

0.63  x  10“7 

2.3  x  109 

Hydrogen  . 

2.8  x  1011 

175 

0.53  x  10“7 

3.3  x  109 

Hydrogen  . 

4.4  x  1011 

201 

0.46  x  10“7 

4.4  x  109 

Air  .... 

2.5  x  1011 

176 

0.61  x  10“7 

2.8  x  109 

Air  .... 

4.2  x  1011 

200 

0.48  x  10“7 

4.1  x  109 

See  Note  for  meaning  of  ★ 
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2.5  Covariance  and  Correlation  Between  Random  Variables 

It  is  common  to  measure  more  than  one  random  variable  in  a  given  experiment.  The 
variables  are  often  related  to  one  another  and  it  is  therefore  necessary  to  define  a 
measure  of  how  one  variable  affects  the  measurement  of  the  others.  Consider  the 
case  in  which  we  wish  to  measure  both  the  length  of  one  side  of  a  square  and  the 
area;  it  is  clear  that  the  two  quantities  are  related  in  a  way  that  the  change  of  one 
quantity  affects  the  other  in  the  same  manner,  i.e.,  a  positive  change  of  the  length 
of  the  side  results  in  a  positive  change  of  the  area.  In  this  case,  the  length  and  the 
area  will  be  said  to  have  a  positive  correlation.  In  this  section  we  introduce  the 
mathematical  definition  of  the  degree  of  correlation  between  variables. 


2.5.1  Joint  Distribution  and  Moments  of  Two  Random 
Variables 

When  two  (or  more)  variables  are  measured  at  the  same  time  via  a  given  experiment, 
we  are  interested  in  knowing  what  is  the  probability  of  a  given  pair  of  measure¬ 
ments  for  the  two  variables.  This  information  is  provided  by  the  joint  probability 
distribution  function,  indicated  as  h(x,y ),  with  the  meaning  that  h(x,y)dxdy  is  the 
probability  that  the  two  variables  X  and  Y  are  in  a  two-dimensional  interval  of  size 
dxdy  around  the  value  (v,y).  This  two-dimensional  function  can  be  determined 
experimentally  via  its  sample  distribution,  in  the  same  way  as  one-dimensional 
distributions. 

It  is  usually  convenient  to  describe  one  variable  at  a  time,  even  if  the  experiment 
features  more  than  just  one  variable.  In  this  case,  the  expectation  of  each  variable 
(for  example,  X)  is  defined  as 


/Too  /»Too 

/  xh(x,y)dxdy  —  ptx  (2.15) 

-oo  J— oo 

and  the  variance  is  similarly  defined  as 

/Too  r*  Too 

/  (x-  iix)2h(x,y)dxdy  =  a2.  (2.16) 

-OO  2—00 

These  equations  recognize  the  fact  that  the  other  variable,  in  this  case  Y ,  is  indeed 
part  of  the  experiment,  but  is  considered  uninteresting  for  the  calculation  at  hand. 
Therefore  the  uninteresting  variable  is  integrated  over,  weighted  by  its  probability 
distribution  function. 
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The  covariance  of  two  random  variables  is  defined  as 

Cov(X ,  Y)  =  E[(X  -  fix)(Y  -  fiy)]  = 

/»  +  oo  /»+00  (2.17) 

/  /  (x  —  jix)(y  —  Hy)h{x,  y)dxdy  =  a*y. 

The  covariance  is  the  expectation  of  the  product  of  the  deviations  of  the  two 
variables.  Unlike  the  deviation  of  a  single  variable,  whose  expectation  is  always 
zero,  this  quantity  will  be  positive  if,  on  average,  a  positive  deviation  of  X  is 
accompanied  by  a  positive  deviation  of  Y,  or  if  two  negative  deviations  are  likely 
to  occur  simultaneously,  so  that  the  integrand  is  a  positive  quantity.  If,  on  the  other 
hand,  the  two  variables  tend  to  have  deviations  of  opposite  sign,  the  covariance  will 
be  negative.  The  covariance,  like  the  mean  and  variance,  is  a  parent  quantity  that 
can  be  calculated  from  the  theoretical  distribution  of  the  random  variables. 

The  sample  covariance  for  a  collection  of  N  pairs  of  measurements  is  calcu¬ 
lated  as 


N 


N- 


t  yy*,  -xH.y,  ->>), 

i=  1 


(2.18) 


using  a  similar  equation  to  the  sample  variance. 

The  correlation  coefficient  p  is  simply  a  normalized  version  of  the  covariance, 


p(X,  Y) 


Cov(X,Y ) 

(JX(Jy 


(2.19) 


The  correlation  coefficient  is  a  number  between  —  1  and  + 1 .  When  the  correlation 
is  zero,  the  two  variables  are  said  to  be  uncorrelated .  The  fact  that  the  correlation 
coefficient  is  normalized  to  within  the  values  ±1  derives  from  (2.10)  and  the 
properties  of  the  joint  distribution  function. 

The  sample  correlation  coefficient  is  naturally  defined  as 


r  = 


sxsy 


(2.20) 


in  which  s2x  and  s2y  are  the  sample  variances  of  the  two  variables. 

The  covariance  between  two  random  variables  is  very  important  in  evaluating  the 
variance  in  the  sum  (or  any  other  function)  of  two  random  variables,  as  explained  in 
detail  in  Chap.  4.  The  following  examples  illustrate  the  calculation  of  the  covariance 
and  the  sample  covariance. 
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Example  2.3  ( Variance  of  Sum  of  Variables )  Consider  the  random  variables  X,  Y 
and  the  sum  Z  =  X  +  Y:  the  variance  is  given  by 


r\ 

(x  +  y  -  (/jlx  +  piy))  h(x,  y)dxdy  = 


Var{X)  +  Var(Y)  +  2Cov(X,  Y) 


which  can  also  be  written  in  the  compact  form  o:  —  of  +  of  +  2 of,.  This  shows 
that  variances  add  linearly  only  if  the  two  random  variables  are  uncorrelated.  Failure 
to  check  for  correlation  will  result  in  errors  in  the  calculation  of  the  variance  of  the 
sum  of  two  random  variables.  <> 

Example  2.4  Consider  the  measurement  of  the  following  pairs  of  variables:  (0,  2), 
(2,  5),  (1,  4),  (—3,  —1).  We  can  calculate  the  sample  covariance  by  means  of  the 
following  equation: 


t  'Em*  -  x){y‘  - y ) 


i=  1 


17 

y 


where  v  =  0  and  y  —  2.5.  Also,  the  individual  variances  are  calculated  as 


1  E(Xt  ~ x>2  = 


i=  1 


14 

y 


21 

y 


which  results  in  the  sample  correlation  coefficient  between  the  two  random  vari¬ 
ables  of 


17 

r  —  —  0.99. 

V14x21 

This  is  in  fact  an  example  of  nearly  perfect  correlation  between  the  two  variables. 
In  fact,  positive  deviations  of  one  variable  from  the  sample  mean  are  accompanied 
by  positive  deviations  of  the  other  by  nearly  the  same  amount.  <> 


2.5.2  Statistical  Independence  of  Random  Variables 

The  independence  between  events  was  described  and  quantified  in  Chap.  1,  where 
it  was  shown  that  two  events  are  independent  only  when  the  probability  of  their 
intersection  is  the  product  of  the  individual  probabilities.  The  concept  is  extended 
here  to  random  variables  by  defining  two  random  variables  as  independent  if  and 
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only  if  the  joint  probability  distribution  function  can  be  factored  in  the  following 
form: 


h(x,y)  =  f(x)  ■  g(y), 


(2.21) 


where  f(x)  and  g(y)  are  the  probability  distribution  functions  of  the  two  random 
variables.  When  two  variables  are  independent,  the  individual  probability  distribu¬ 
tion  function  of  each  variable  is  obtained  via  marginalization  of  the  joint  distribution 
with  respect  to  the  other  variable,  e.g., 


fix) 


/+oo 

-oo 


h(x,y)dy. 


(2.22) 


It  is  important  to  remark  that  independence  between  random  variables  and 
uncorrelation  are  not  equivalent  properties.  Independence,  which  is  a  property  of 
the  distribution  functions,  is  a  much  stronger  property  than  uncorrelation,  which  is 
based  on  a  statement  that  involves  only  moments.  It  can  be  proven  that  independence 
implies  uncorrelation,  but  not  vice  versa. 

Proof  The  fact  that  independence  implies  uncorrelation  is  shown  by  calculat¬ 
ing  the  covariance  of  two  independent  random  variables  of  joint  distribution 
function  h(x ,  y).  The  covariance  is 


/  +  oo  p  4~oo 

/  (x  -  jix)(y  -  IXy)h(x,  y)dxdy  = 
-OO  J  —  OO 


_2 

axy  - 


— OO  J—  oo 
'+00 


/  +  oo  p+oo 

(x  -  dx)f(x)dx  /  O'  -  n  v)g(y)dy  =  0, 

-OO  J  —  DC 

since  each  integral  vanishes  as  the  expectation  of  the  deviation  of  a  random 
variable.  □ 

As  a  counter-example  of  the  fact  that  dependent  variables  can  have  non¬ 
zero  correlation  factor,  consider  the  case  of  a  random  variable  X  with  a 
distribution  f(x)  that  is  symmetric  around  the  origin,  and  another  variable 
Y  —  X2.  They  cannot  be  independent  since  they  are  functionally  related,  but 
it  will  be  shown  that  their  covariance  is  zero.  Symmetry  about  zero  implies 
l±x  —  0.  The  mean  of  Y  is  E[Y]  =  E[X 2]  =  o2  since  the  mean  of  X  is  null. 
From  this,  the  covariance  is  given  by 


Cov(X,  Y)  =  E[X(Y  -  o2)}  =  E[X 3  -  Xo2]  =  E[X 3]  =  0 

due  to  the  symmetry  of  f(x).  Therefore  the  two  variables  X  and  X2  are 
uncorrelated,  yet  they  are  not  independent. 


Example  2.5  ( Photon  Counting  Experiment )  A  photon-counting  experiment  con¬ 
sists  of  measuring  the  total  number  of  photons  in  a  given  time  interval  and  the 
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number  of  background  events  detected  by  the  receiver  in  the  same  time  interval. 
The  experiment  is  repeated  six  times,  by  measuring  simultaneously  the  total  number 
of  counts  T  as  (10, 13, 11, 8, 10, 14)  and  the  number  of  background  counts  B  as 
(2,  3, 2, 1, 1, 3).  We  want  to  estimate  the  mean  number  of  source  photons  and  its 
standard  error. 

The  random  variable  we  seek  to  measure  is  S  —  T—B  and  the  mean  and  variance 
of  this  random  variable  can  be  easily  shown  to  be 

/I  s  =  /I  t  ~  l^B 

_2  _  _2  i  _2  o  2 

—  CJ j’  I-  CJ 

(the  derivation  is  similar  to  that  of  Example  2.3).  From  the  data,  we  measure  the 
sample  means  and  variances  as  T  —  11.0,  B  —  2.0,  s2T  —  4.8,  s\  —  0.8  and  the 
sample  covariance  as  s2TB  —  +1.6. 

Notice  that  the  correlation  coefficient  between  T  and  S',  as  estimated  via  the 
measurements,  is  then  given  by  corr(T,  B)  —  1.6/  V4.8  x  0.8  =  0.92,  indicating  a 
strong  degree  of  correlation  between  the  two  measurements.  The  measurements  can 
be  summarized  as 


/is  =  11.0-2.0  =  9.0 

<+?  =  4.8  +  0.8  —  2  x  1.6  =  2.4 

and  be  reported  as  S  —  9.00  ±  1.55  counts  (per  time  interval).  Notice  that  if  the 
correlation  between  the  two  measurements  had  been  neglected,  then  one  would 
(erroneously)  report  S  —  9.00±2.37,e.g.,  the  standard  deviation  would  be  largely 
overestimated.  The  correlation  between  total  counts  and  background  counts  in  this 
example  has  a  significant  impact  in  the  calculation  of  the  variance  of  S  and  needs  to 
be  taken  into  account.  <> 


2.6  A  Classic  Experiment:  Pearson’s  Collection  of  Data 
on  Biometric  Characteristics 

In  1903  K.  Pearson  published  the  analysis  of  a  collection  of  biometric  data  on 
more  than  1000  families  in  the  United  Kingdom,  with  the  goal  of  establishing 
how  certain  characters,  such  as  height,  are  correlated  and  inherited  [33]. 
Prof.  Pearson  is  also  the  inventor  of  the  /2  test  and  a  central  figure  in  the 
development  of  the  modern  science  of  statistics. 

Pearson  asked  a  number  of  families,  composed  of  at  least  the  father, 
mother,  and  one  son  or  daughter,  to  perform  measurements  of  height,  span  of 
arms  and  length  of  left  forearm.  This  collection  of  data  resulted  in  a  number 
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of  tables,  including  some  for  which  Pearson  provides  the  distribution  of  two 
measurements  at  a  time.  One  such  table  is  that  reporting  the  mother’s  height 
versus  the  father’s  height,  Table  2.3. 

The  data  reported  in  Table  2.3  represent  the  joint  probability  distribution  of 
the  two  physical  characters,  binned  in  one-inch  intervals.  When  a  non-integer 
count  is  reported  (e.g.,  a  value  of  0.25,  0.5  or  0.75),  we  interpret  it  as  meaning 
that  the  original  measurement  fell  exactly  at  the  boundary  between  two  cells, 
although  Pearson  does  not  provide  an  explanation  for  non-integer  values. 

For  every  column  and  row  it  is  also  reported  the  sum  of  all  counts.  The 
bottom  row  in  the  table  is  therefore  the  distribution  of  the  father’s  height, 
irrespective  of  the  mother’s  height,  likewise  the  rightmost  column  is  the 
distribution  of  the  mother’s  height,  regardless  of  the  father’s  height.  The 
process  of  obtaining  a  one-dimensional  distribution  from  a  multi-dimensional 
illustrates  the  marginalization  over  certain  variables  that  are  not  of  interest. 
In  the  case  of  the  bottom  column,  the  marginalization  of  the  distribution  was 
done  over  the  mother’s  height,  to  obtain  the  distribution  of  father’s  height. 

From  Table  2.3  it  is  not  possible  to  determine  whether  there  is  a  correlation 
between  father’s  and  mother’s  heights.  In  fact,  according  to  (2.18),  we  would 
need  all  1079  pairs  of  height  measurements  originally  collected  by  Pearson 
to  calculate  the  covariance.  Since  Pearson  did  not  report  these  raw  (i.e, 
unprocessed)  data,  we  cannot  calculate  either  the  covariance  or  the  correlation 
coefficient.  The  measurements  reported  by  Pearson  are  in  a  format  that  goes 
under  the  name  of  contingency  table ,  consisting  of  a  table  with  measurements 
that  are  binned  into  suitable  two-dimensional  intervals. 


Summary  of  Key  Concepts  for  this  Chapter 

□  Random  variable:  A  quantity  that  is  not  known  exactly  and  is  described 
by  a  probability  distribution  function/(v). 

□  Moments  of  a  distribution :  Expectations  for  the  random  variable  or 
functions  of  the  random  variable,  such  as  the  mean  \i  —  E[X]  and  the 
variance  a2  =  E[(X  —  fi)2]. 

□  Sample  mean  and  sample  variance :  Quantities  calculated  from  the  mea¬ 
surements  that  are  intended  to  approximate  the  corresponding  parent 
quantities  (mean  and  variance). 

□  Joint  distribution  function :  The  distribution  of  probabilities  for  a  pair  of 
variables. 

□  Covariance:  A  measure  of  the  tendency  of  two  variables  to  follow  one 
another,  Cov(X,  Y )  =  E[(X  —  tix)(Y  —  pty)]- 

□  Correlation  coefficient:  A  normalized  version  of  the  covariance  that  takes 
values  between  -1  (perfect  anti-correlation)  and  +1  (perfect  correlation). 

□  Statistically  independent  variables:  Two  variables  whose  joint  probability 
distribution  function  can  be  factored  as  h(x,  y)  =  f(x)g(y). 
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Problems 


2.1  Consider  the  exponential  distribution 

f(x)  =  Xe~Xx 


where  A  >  0  and  v  >  0.  Show  that  the  distribution  is  properly  normalized,  and 
calculate  the  mean,  variance  and  cumulative  distribution  F(x). 

2.2  Consider  the  sample  mean  as  a  random  variable  defined  by 


v  = 


(2.23) 


where  xi  are  identical  independent  random  variables  with  mean  \i  and  variance  a2 . 
Show  that  the  variance  of  x  is  equal  to  <j2/N. 

2.3  J .J.  Thomson’s  experiment  aimed  at  the  measurement  of  the  ratio  between  the 
mass  and  charge  of  the  electron  is  presented  on  page  23.  Using  the  datasets  for  Tube 
1  and  Tube  2  separately,  calculate  the  mean  and  variance  of  the  random  variables 
W/Q  and  /,  and  the  covariance  and  correlation  coefficient  between  W/Q  and  /. 

2.4  Using  J.J.  Thomson’s  experiment  (page  23),  verify  the  statement  that  “It  will 
be  seen  from  these  tables  that  the  value  of  m/ e  is  independent  of  the  nature  of 
the  gas”  used  in  the  experiment.  You  may  do  so  by  calculating  the  mean  and 
standard  deviation  for  the  measurements  in  each  gas  (air,  hydrogen,  and  carbonic 
acid)  and  testing  whether  the  three  measurements  agree  with  each  other  within  their 
standard  deviations. 

2.5  Calculate  the  sample  covariance  and  correlation  coefficient  for  the  following 
set  of  data:  (0, 2),  (2,  5),  (1, 4),  (3, 1). 

2.6  Prove  that  the  following  relationship  holds, 

Var(X )  =  E[X2]  -  n2 


where  /z  is  the  mean  of  the  random  variable  X. 


Chapter  3 

Three  Fundamental  Distributions:  Binomial, 
Gaussian,  and  Poisson 


Abstract  There  are  three  distributions  that  play  a  fundamental  role  in  statistics.  The 
binomial  distribution  describes  the  number  of  positive  outcomes  in  binary  experi¬ 
ments,  and  it  is  the  “mother”  distribution  from  which  the  other  two  distributions 
can  be  obtained.  The  Gaussian  distribution  can  be  considered  as  a  special  case  of 
the  binomial,  when  the  number  of  tries  is  sufficiently  large.  For  this  reason,  the 
Gaussian  distribution  applies  to  a  large  number  of  variables,  and  it  is  referred  to  as 
the  normal  distribution.  The  Poisson  distribution  applies  to  counting  experiments, 
and  it  can  be  obtained  as  the  limit  of  the  binomial  distribution  when  the  probability 
of  success  is  small. 


3.1  The  Binomial  Distribution 

Many  experiments  can  be  considered  as  binary ,  meaning  that  they  can  only  have 
two  possible  outcomes  which  we  can  interpret  as  success  or  failure.  Even  complex 
experiments  with  a  larger  number  of  possible  outcomes  can  be  described  as  binary, 
when  one  is  simply  interested  about  the  occurrence  of  a  specific  event  A,  or  its  non¬ 
occurrence,  A.  It  is  therefore  of  fundamental  importance  in  statistics  to  determine 
the  properties  of  binary  experiments,  and  the  distribution  of  the  number  of  successes 
when  the  experiment  is  repeated  for  a  number  of  times  under  the  same  experimental 
conditions. 


3.1.1  Derivation  of  the  Binomial  Distribution 

Consider  a  binary  experiment  characterized  by  a  probability  of  success  p  a  therefore 
a  probability  of  failure  q  —  1  —p.  The  probabilities  p  and  q  are  determined  according 
to  the  theory  of  probability  and  are  assumed  to  be  known  for  the  experiment  being 
considered.  We  seek  the  probability  of  having  n  successes  in  N  tries,  regardless  of 
the  order  in  which  the  successes  take  place.  For  example,  consider  tossing  four 
coins,  and  being  interested  in  any  two  of  these  coins  showing  heads  up,  as  an 
indication  of  success  of  the  toss.  To  obtain  this  probability,  we  start  by  counting 
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how  many  possible  outcomes  for  the  experiments  are  possible,  and  break  down  the 
derivation  into  three  parts: 

•  Probability  of  a  specific  sequence  of  n  successes  out  of  N  tries.  Assume  that 
successive  experiments  are  independent,  e.g.,  one  tosses  the  same  coin  many 
times,  each  time  independently  of  each  other.  The  probability  of  having  n 
successes  and  therefore  N— n  failures  occurring  in  a  specific  sequence ,  is  given  by 

Pispecific  sequence  of  n  successes)  =  pn  x  qN~n.  (3.1) 

This  result  can  be  seen  by  using  the  property  of  independence  among  the  N 
events,  of  which  the  n  successes  carry  a  probability  p ,  and  the  ( N  —  n)  failures  a 
probability  q. 

Example  3.1  Considering  the  case  of  four  coin  tosses,  the  probability  of  a  given 
sequence,  for  example  “heads-tails-tails-heads,”  is  (}/i)  x  (f/i)  x  (l/i)  x  (f/i)  — 
(V2)4,  since  p  —  q  —  l/i.  Successive  tosses  are  assumed  to  be  independent.  <> 

•  Number  of  ordered  sequences.  We  start  by  counting  how  many  ordered  sequences 
exist  that  have  n  successes  out  of  N  tries.  If  there  are  no  successes  ( n  —  0),  then 
there  is  only  one  possible  sequence  with  N  failures.  If  n  >  0,  each  of  the  N  tries 
can  yield  the  “first”  success,  and  therefore  there  are  N  possibilities  for  what  try 
is  the  first  success.  Continuing  on  to  the  “second”  success,  there  are  only  N  —  1 
possibilities  left  for  what  trial  will  be  the  second  success,  and  so  on.  This  leads  to 
the  following  number  of  sequences  containing  n  time-ordered  successes,  that  is, 
sequences  for  which  we  keep  track  of  the  order  in  which  the  successes  occurred: 

Ad 

Perm(ft,  N)  =  N  •  (N  -  1)  •  (N  -  n  +  1)  =  - - — .  (3.2) 

(N  —  n) ! 

This  is  called  the  number  of  permutations  of  n  successes  out  of  N  tries.  This 
method  of  counting  sequences  can  also  be  imagined  as  the  placement  of  each 
success  in  a  “success  box”:  the  first  place  in  this  box  can  be  filled  in  N  different 
ways,  the  second  in  (N  —  1)  ways  corresponding  to  the  remaining  tries,  and  so 
on. 

Example  3.2  Consider  the  case  of  n  —  2  successes  out  of  N  —  4  trials. 
According  to  (3.2),  the  number  of  permutations  is  4!/2!  =  12.  We  list  explicitly 
all  12  ordered  sequences  that  give  rise  to  2  successes  out  of  4  tries  in  Table  3.1. 
Symbol  H \  denotes  the  “first  success,”  and  H2  the  “second  success.”  Consider, 
for  example,  lines  5  and  8:  both  represent  the  same  situation  in  which  the  coin  2 
and  3  showed  heads  up,  or  success,  and  they  are  not  really  different  sequences, 
but  the  separate  entries  in  this  table  are  the  result  of  our  method  of  counting 
time-ordered  sequences.  <> 
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Table  3.1  Permutations  (ordered  sequences)  of  2  successes  out  of  4  tries 


Sequence 

Number  of  try 

Sequence 

Number  of  try 

1 

2 

3 

4 

1 

2 

3 

4 

1 

Hi 

h2 

— 

— 

7 

h2 

— 

Hi 

— 

2 

Hi 

— 

h2 

— 

8 

— 

h2 

Hi 

— 

3 

Hi 

— 

— 

h2 

9 

— 

— 

Hi 

h2 

4 

h2 

Hi 

— 

— 

10 

h2 

— 

— 

Hi 

5 

— 

Hi 

H2 

— 

11 

— 

H2 

— 

Hi 

6 

— 

Hi 

— 

H2 

12 

— 

— 

h2 

Hi 

In  reality,  we  are  not  interested  in  the  time  order  in  which  the  n  successes 
occur,  since  it  is  of  no  consequence  whether  the  first  or  the  Mh,  or  any  other,  try 
is  the  “first”  success.  We  must  therefore  correct  for  this  artifact  in  the  following. 

•  Number  of  sequences  of  n  successes  out  of  N  tries  (regardless  of  order).  As  it 
is  clear  from  the  previous  example,  the  number  of  permutations  is  not  quite 
the  number  we  seek,  since  it  is  of  no  consequence  which  success  happened 
first.  According  to  (3.2),  there  are  n\  ways  of  ordering  n  successes  among 
themselves,  or  Perm (n,n)  =  n\.  Since  all  n\  permutations  give  rise  to  the 
same  practical  situation  of  n  successes,  we  need  to  divide  the  number  of  (time- 
ordered)  permutations  by  n\  in  order  to  avoid  double-counting  of  permutations 
with  successes  in  the  same  trial  number.  It  is  therefore  clear  that,  regardless  of 
time  order,  the  number  of  combinations  of  n  successes  out  of  N  trials  is 


C(n,  N) 


Perm  (n,  N ) 
n! 


Nl 

(N  —  n)\n\ 


(3.3) 


The  number  of  combinations  is  the  number  we  seek,  i.e.,  the  number  of  possible 
sequences  of  n  successes  in  N  tries. 

Example  3.3  Continue  to  consider  the  case  of  2  successes  out  of  4  trials.  There  are 
2!  =  2  ways  to  order  the  2  successes  among  themselves  (either  one  or  the  other 
is  the  first  success).  Therefore  the  number  of  combinations  of  2  successes  out  of 
4  trials  is  6,  and  not  12.  As  indicated  above,  in  fact,  each  sequence  had  its  “twin” 
sequence  listed  separately,  and  (3.3)  correctly  counts  only  different  sequences.  O 

According  to  the  results  obtained  above,  what  remains  to  be  done  is  to  use  the 
probability  of  each  sequence  (3.1)  and  multiply  it  by  the  number  of  combinations  in 
(3.3)  to  obtain  the  overall  probability  of  having  n  successes  in  N  trials: 


Pin )  = 


n 


n  =  0, . . . ,  N. 


(3.4) 
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This  distribution  is  known  as  the  binomial  distribution  and  it  describes  the  probabil¬ 
ity  of  n  successes  in  N  tries  of  a  binary  experiment.  It  is  a  discrete  distribution  that 
is  defined  for  non-negative  values  n  <  N.  The  factor  in  (3.3)  is  in  fact  the  binomial 
coefficient  and  it  derives  its  name  from  its  use  in  the  binomial  expansion 


(p  +  qf  = 

n= 0 


(3.5) 


3.1.2  Moments  of  the  Binomial  Distribution 

The  moment  of  mth  order  for  a  discrete  random  variable  X  of  distribution  P(n )  is 
given  by 


N 

E[Xm ]  =  ir  =  y]  nmP(n). 

n= 0 


(3.6) 


We  can  show  that  the  mean  and  the  second  moment  of  the  binomial  distribution 
are  given  by 


In  =  pN 
n2  —  n2  +  pqN. 


Proof  Start  with  the  mean, 


N 


N 


N 


N 


n 


n= 0 


n= 0 


n= 0 


N\ 

3  1 

■ 

[P3p\ 

pnqN~n; 


9 

in  which  we  have  introduced  a  linear  operator  p—~  that  can  be  conveniently 

op 

applied  to  the  entire  sum, 


d 

n=P^~ 

op 


‘  N 

E 

_n= 0 


N 


n 


\pnqN~n 


=  Pt(P  +  f)N  =  pn(p  +  <i)N  1  =  PN- 

op 
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The  derivation  for  the  moment  n 2  is  similar: 


N 

n 2  =  P(n)n 2 

/?=() 


E 


rrpnqN~n 


d 


-i2 


dp 


(p  +  q)N  =  p^r  [pN(p  +  q)N  ‘] 


P  [Nip  +  qf~l  +PN(N  -  1  )(p  +  q)N~2]  = 
pN  +  p2N(N  -l)=pN+  (pN)2  -  p2N  = 


n 2  +  p{  1  —  p)N  —  n2  +  pqN. 


□ 

It  follows  that  the  variance  of  the  binomial  distribution  is  given  by 

a2  =  E[(X  -  n)2}  =  pqN.  (3.8) 

Equations  (3.7)  and  (3.8)  describe  the  most  important  features  of  the  binomial 
distribution,  shown  in  Fig.  3.1  for  the  case  of  Af  =  10.  The  mean  is  naturally  given 
by  the  product  of  the  number  of  tries  N  and  the  probability  of  success  p  in  each  of 
the  tries.  The  standard  deviation  a  measures  the  root  mean  square  of  the  deviation 
and  it  is  the  measurement  of  the  width  of  the  distribution. 

Example  3.4  (Probability  of  Overbooking)  An  airline  knows  that  5  %  of  the 
persons  making  reservations  will  not  show  up  at  the  gate.  On  a  given  flight  that 


CO 


Random  Variable  X 


Fig.  3.1  Binomial  distribution  with  p  =  q  =  0.5  and  N  =  10.  The  dotted  lines  around  the  mean 
mark  the  zha  range 
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can  seat  50  people,  52  tickets  have  been  sold.  Calculate  the  probability  that  there 
will  be  a  seat  available  for  every  passenger  that  will  arrive  at  the  gate. 

This  is  a  binary  experiment  in  which  p  =  0.95  is  the  probability  that  a  passenger 
will  show.  For  that  specific  flight,  N  —  52  passenger  have  the  choice  of  showing 
(or  not).  The  probability  that  there  is  a  seat  available  for  each  passenger  is  therefore 
given  by  P  —  1  —  TV (52)  +  PN(5 1),  which  is  calculated  as 


1  -  (0.95)52 


52  •  (0.95)51  -0.05  =  0.741. 


Therefore  the  airline  is  willing  to  take  a  25.9  %  chance  of  having  an  overbooked 
flight.  <> 


3.2  The  Gaussian  Distribution 

The  Gaussian  distribution,  often  referred  to  as  the  normal  distribution,  can  be 
considered  as  a  special  case  of  the  binomial  distribution  in  the  case  of  a  large  number 
N  of  experiments  performed.  In  this  section  we  derive  the  Gaussian  distribution  from 
the  binomial  distribution  and  describe  the  salient  features  of  the  distribution. 


3.2.1  Derivation  of  the  Gaussian  Distribution 
from  the  Binomial  Distribution 

The  binomial  distribution  of  (3.4)  acquires  a  simpler  form  when  N  is  large.  An 
alternative  analytic  expression  to  the  binomial  distribution  is  a  great  advantage, 
given  the  numerical  difficulties  associated  with  the  evaluation  of  the  factorial 
of  large  numbers.  As  was  evident  from  Fig.  3.1,  the  binomial  distribution  has  a 
maximum  at  value  n  —  Np.  In  the  following  we  prove  that  the  binomial  distribution 
can  be  approximated  as 


1  0 n-Np )2 

P{n)  ~  e  2Npv  (3.9) 

+J2nNpq 

in  the  case  in  which  N  1 ,  and  for  values  of  the  variable  that  are  close  to  the  peak 
of  the  distribution. 
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Proof  Expand  the  logarithm  of  the  binomial  probability  as  a  Taylor  series  in 
the  neighborhood  of  the  peak  value  h , 


OO 

In  P(n)  =  In  P(h)  +  — Ank 

k ! 

k=  1 


where  An  —  n  —  h  is  the  deviation  from  the  peak  value  and 


8  In  P(n)k 


dkn 


n=n 


is  the  coefficient  of  the  Taylor  series  expansion.  Since,  by  assumption,  h  is 
a  point  of  maximum,  the  first  coefficient  is  null,  d\nP{n)  /  dn\n=~  =  0.  We 
neglect  terms  of  order  0(An 3)  in  the  expansion,  and  the  approximation  results 
in 


1  9 

In  P{n)  —  In  P(h)  H — B2An2, 

3 

where  B2  is  negative,  since  n  —  h  is  a  point  of  maximum.  It  follows  that 

I# 2 1  An2 

P(n )  ~  P(h)e  2 


Neglecting  higher-order  terms  in  An  means  that  the  approximation  will  be 
particularly  accurate  in  regions  where  An  is  small,  i.e.,  near  the  peak  of  the 
distribution.  Away  from  the  peak,  the  approximation  will  not  hold  with  the 
same  precision. 

In  the  following  we  show  that  the  unknown  terms  can  be  calculated  as 


B  9  = 


Npq 


Pin )  = 


^ J2nNpq 


First,  we  calculate  the  value  of  \B2\ .  Start  with 


In  P{n)  —  In 


N\ 


-pnqN~n 


n\(N  —  n)\ 

In N\  —  Inn!  —  ln(A  —  n)\  +  nlnp  +  (N  —  n)  In q. 


At  this  point  we  need  to  start  treating  n  as  a  continuous  variable.  This 
approximation  is  acceptable  when  Np  1 ,  so  that  values  n  of  the  random 
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variable  near  the  peak  of  the  distribution  are  large  numbers.  In  this  case,  we 
can  approximate  the  derivative  of  the  logarithm  with  a  difference, 

d  In  n ! 

— - —  =  (In (n  +  1)!  —  \nn\)/l  =  In (n  +  1)  ~  In n. 
on 

From  this  it  follows  that  the  first  derivative  of  the  probability  function,  as 
expected,  is  zero  at  the  peak  value, 


d  In  P(n) 
dn 


=  —  In  n  +  ln(N  -  n)  +  Inp  -  In  q\n=n 


n=n 


,  N  —  np  , 

=  In  I - 1=0 

n  q 


so  that  the  familiar  result  of  h  =  p  •  N  is  obtained.  This  leads  to  the  calculation 
of  the  second  derivative. 


1 


Bi  = 


d 


d2  In  P{n) 
dn 2 


n=n 


8 

dn 


=  —  ( ln(N  —  n)  —  In  n) 
dn 


n=n 


1 


1 


1 


N-h  Np  N(l-p) 


V  n  qj 

n=n 

1  1 

N  —  n  n 

n=n 

p  +  q  _ 

1 

Npq  Npq 

Finally,  the  normalization  constant  P(h)  can  be  calculated  making  use  of 
the  integral 


/oo 

e~ax  dx  —  yj ~7tja. 

-oo 


Enforcing  the  normalization  condition  of  the  probability  distribution  function, 


'OO 


\B2\Anz 


2  Tt 


f  P(h)e  2  dAn  =  P(n) 

2  —  00  V  1 2*2 1 


=  1 


we  find  that  P(n)  =  1  /  ^JlitNpq.  We  are  therefore  now  in  a  position  to  obtain 
an  approximation  to  the  binomial  distribution,  valid  when  n  1 : 


P(n)  = 


1 


y/lrtNpq 


( n—NpY 
2NPq 


□ 
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Using  the  fact  that  the  mean  of  the  distribution  is  fi  —  Np,  and  that  the  variance 
is  a2  =  Npq,  the  approximation  takes  the  form 

1  (n-fi)2 

P(n )  =  /  e,  2<r 2  (3.10) 

V27T02 

which  is  the  standard  form  of  the  Gaussian  distribution,  in  which  n  is  a  continuous 
variable.  Equation  (3.10)  read  as  P(n)  being  the  probability  of  occurrence  of  the 
value  n  for  a  given  random  variable  of  mean  (i  and  variance  a2.  The  Gaussian 
distribution  has  the  familiar  “bell”  shape,  as  shown  in  Fig.  3.2.  When  n  becomes 
a  continuous  variable,  which  we  will  call  x,  we  talk  about  the  probability  of 
occurrence  of  the  variable  in  a  given  range  v,  v  +  dx.  The  Gaussian  probability 
distribution  function  is  thus  written  as 


f(x)dx  — 


l  (x-fi)2 

e  2d 2  dx. 

sThu?- 


(3.11) 


A  Gaussian  of  mean  /j  and  variance  a 2  is  often  referred  to  as  N(/i,  a).  The  standard 
Gaussian  is  one  with  zero  mean  and  unit  variance,  indicated  by  N( 0, 1). 


Fig.  3.2  Gaussian  distribution  with  /x  =  50  and  a2  =  12.5,  corresponding  to  a  binomial 
distribution  of  p  =  q  =  0.5,  and  N  =  100.  The  Gaussian  distribution  is  symmetrical  around 
the  mean  and  therefore  the  mean,  mode,  and  median  coincide.  The  dotted  lines  around  the  mean 
mark  the  it  a  range 
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3.2.2  Moments  and  Properties  of  the  Gaussian  Distribution 

The  parameters  /x  and  a2  are,  respectively,  the  mean  and  variance  of  the  Gaussian 
distribution.  These  results  follow  from  the  derivation  of  the  Gaussian  distribution 
from  the  binomial,  and  can  be  confirmed  by  direct  calculation  of  expectations  from 
(3.11).  Central  moments  of  odd  order  are  zero,  since  the  Gaussian  is  symmetric  with 
respect  to  the  mean. 

Given  its  wide  use  in  statistics,  it  is  important  to  quantify  the  “effective  width” 
of  the  Gaussian  distribution  around  its  mean.  The  half  width  at  half  maximum , 
or  HWHM,  is  defined  as  the  range  of  v  between  the  peak  and  the  point  where 
P(x )  =  0 .5P(/x).  It  can  be  easily  shown  that  the  HWHM  has  a  size  of  approximately 
1.18a,  meaning  that  the  half-maximum  point  is  just  past  one  standard  deviation  of 
the  mean.  By  the  same  token,  the  full-width  at  half  maximum,  or  FWHM,  is  defined 
as  the  range  between  the  two  points  where  P(x)  —  0 .5P(/x).  It  is  twice  the  HWHM, 
or  2.35a  in  size.  Tables  of  the  Gaussian  distribution  are  provided  in  Appendix  A.l. 

The  range  between  the  points  v  =  g  ±  a  is  a  common  measure  of  the  effective 
range  of  the  random  variable.  The  probability  of  a  Gaussian  variable  to  be  in  the 
range  from  /x  —  a  to  /x  +  a  is  calculated  as  the  integral  of  the  probability  distribution 
function  between  those  limits.  In  general,  we  define  the  integral 

A(z)  —  /  f(x)dx—  _  /  e~  2  dx  (3.12) 

J  ll—ZO  V  271  J —  7 

wher  e/(v)  is  the  Gaussian  distribution;  this  integral  is  related  to  the  error  function, 

erf z=  [  e~*2 dx.  (3.13) 

\JTt  J — z 

The  function  A(z)  is  tabulated  in  Appendix  A.l  The  probability  of  the  variable  to 
be  within  one  a  of  the  mean  is  A(l)  =  0.683,  or  68.3  %.  The  range  of  v  between 
/x  —  a  and  /x  +  a  therefore  corresponds  to  a  68.3  %  interval  of  probability,  and  it 
is  referred  to  as  the  la  interval.  The  correspondence  between  the  la  interval  and 
the  68.3  %  confidence  interval  applies  strictly  only  to  the  Gaussian  distribution,  for 
which  the  value  of  a  is  defined  via  the  distribution  function.  It  is  common  practice, 
however,  to  calculate  to  the  68.3  %  interval  (sometimes  shortened  to  68  %)  even  for 
those  random  variables  that  do  not  strictly  follow  a  Gaussian  distribution,  and  refer 
to  it  as  the  la  interval.  The  probability  associated  with  characteristic  intervals  of  a 
Gaussian  variable  is  also  reported  in  Table  3.2. 

The  cumulative  distribution  of  a  Gaussian  random  variable  N( 0, 1)  is  defined  by 
the  following  integral: 
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Table  3.2  Probability 
associated  with  characteristic 
intervals  of  a  Gaussian 
distribution 


Interval 

Integrated  probability 

/x  —  <7,  /x  +  <7  ( lcr  interval) 

0.6827,  or  68.27  % 

/x  —  2cr,  [i  T  2cr  (2cr  interval) 

0.9545,  or  95.45% 

li  —  3cr,  /x  +  3a  (3 cr  interval) 

0.9973,  or  99.73  % 

li  —  4cr,  /i  T  4cr  (4cr  interval) 

0.9999,  or  99.99  % 

the  integral  can  be  calculated  as  B(z)  —  1/2  +  A(z)/2  for  z  >  0  and  it  is  tabulated  in 
Table  A. 3.  For  z  <  0,  the  table  can  be  used  to  calculate  the  cumulative  distribution 
as  B(z )  =  1  -  B(—z). 


3.2.3  How  to  Generate  a  Gaussian  Distribution 
from  a  Standard  Normal 

All  Gaussian  distributions  can  be  obtained  from  the  standard  N( 0,  1)  via  a  simple 
change  of  variable.  If  X  is  a  random  variable  distributed  like  and  Z  a 

standard  Gaussian  N( 0, 1),  then  the  relationship  between  Z  and  X  is  given  by 

X  —  a 

Z  —  - -.  (3.15) 

<7 

The  variable  Z  is  also  referred  to  as  the  z-score  associated  with  the  variable  X.  This 
equation  means  that  if  we  can  generate  samples  from  a  standard  normal,  we  can  also 
have  samples  from  any  other  Gaussian  distribution.  If  we  call  z  a  sample  from  Z, 
then 


v  =  cr  •  z  +  M  (3.16) 

will  be  a  sample  drawn  from  X,  according  to  the  equation  above.  Many  program¬ 
ming  languages  have  a  built-in  function  to  generate  samples  from  a  standard  normal, 
and  this  simple  process  can  be  used  to  generate  samples  from  any  other  Gaussian.  A 
more  general  procedure  to  generate  a  given  distribution  from  a  uniform  distribution 
will  be  presented  in  Sect.  4.8. 


3.3  The  Poisson  Distribution 

The  Poisson  distribution  describes  the  probability  of  occurrence  of  events  in  count¬ 
ing  experiments,  i.e.,  when  the  possible  outcome  is  an  integer  number  describing 
how  many  counts  have  been  recorded.  The  distribution  is  therefore  discrete  and  can 
be  derived  as  a  limiting  case  of  the  binomial  distribution. 
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3.3.1  Derivation  of  the  Poisson  Distribution 


The  binomial  distribution  has  another  useful  approximation  in  the  case  in  which 
p  <$C  1,  or  when  the  probability  of  success  is  small.  In  this  case,  the  number  of 
positive  outcomes  is  much  smaller  than  the  number  of  tries,  n  <<^N,  and  the  factorial 
function  can  be  approximated  as 

N\  =  N(N  -  1)  •  •  •  (N  -  n  +  1)  •  (N  -  n)l  ~  Nn(N  -  n)\. 

We  are  also  interested  in  finding  an  approximation  for  the  qN~n  term  that  appears 
in  the  binomial.  For  this  we  set 


In  qN  n  —  In ( 1  —  p)N  n  =  (N  —  n)  ln(  1  —  p)  ^  —p(N  —  ft)  —  —pN, 
and  therefore  we  obtain  the  approximation 

aN~n  ~  e~pN. 


These  two  approximations  can  be  used  into  (3.4)  to  give 


P(n)  ^ 


Nn(N  -  ft) 


[pne~pN  =  ^EPT-e~pN . 

n\(N  —  ft)!  ft! 


(3.17) 


Since  pN  is  the  mean  of  the  distribution,  we  can  rewrite  our  approximation  as 


(3.18) 


known  as  the  Poisson  distribution.  This  function  describes  the  probability  of 
obtaining  n  positive  outcomes,  or  counts,  when  the  expected  number  of  outcomes  is 
p.  It  can  be  immediately  seen  that  the  distribution  is  properly  normalized,  since 


y'-  =  . 

^  ft! 

n= 0 

A  fundamental  feature  of  this  distribution  is  that  it  is  described  by  only  one 
parameter,  the  mean  /z,  as  opposed  to  the  Gaussian  distribution  that  had  two 
parameters.  This  clearly  does  not  mean  that  the  Poisson  distribution  has  no 
variance — in  that  case,  it  would  not  be  a  random  variable! — but  that  the  variance 
can  be  written  as  function  of  the  mean,  as  will  be  shown  in  the  following. 
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3.3.2  Properties  and  Interpretation  of  the  Poisson  Distribution 

The  approximations  used  in  the  derivation  of  (3.18)  caused  the  loss  of  any  reference 
to  the  initial  binomial  experiment,  and  only  the  mean  /x  —  Np  is  present.  Using 
the  definition  of  mean  and  variance,  it  is  easy  to  prove  that  the  mean  is  indeed  /x, 
and  that  the  variance  is  also  equal  to  the  mean,  Var{n)  —  a2  —  /x .  The  fact  that  the 
mean  equals  the  variance  can  be  seen  using  the  values  for  the  binomial,  /x  =  Np  and 
cr2  =  Npq\  since p  1,  q  —  1,  and  /x  ~  o'2.  As  a  result,  the  Poisson  distribution 
has  only  one  parameter. 

The  Poisson  distribution  can  be  interpreted  as  the  probability  of  occurrence  of  n 
events  in  the  case  of  an  experiment  that  detects  individual  counts,  when  the  mean  of 
the  counts  is  /x.  This  makes  the  Poisson  distribution  the  primary  statistical  tool  for 
all  experiments  that  can  be  expressed  in  terms  of  the  counting  of  a  specific  variable 
associated  with  the  experiment.  Typical  examples  are  the  counting  of  photons  or  the 
counting  of  plants  with  a  given  characteristic,  etc.  When  an  experiment  can  be  cast 
in  terms  of  a  counting  experiment,  even  without  a  specific  reference  to  an  underlying 
binary  experiment,  then  the  Poisson  distribution  will  apply.  All  reference  to  the  total 
number  of  possible  events  ( N )  and  the  probability  of  occurrence  of  each  event  ( p ) 
was  lost  because  of  the  approximation  used  throughout,  i.e.,  p  «  1,  and  only  the 
mean  /x  remains  to  describe  the  primary  property  of  the  counting  experiment,  which 
is  the  mean  or  expectation  for  the  number  of  counts. 

As  can  be  seen  in  Fig.  3.3,  the  Poisson  distribution  is  not  symmetric  with  respect 
of  the  mean,  and  the  distribution  becomes  more  symmetric  for  larger  values  of 
the  mean.  As  for  all  discrete  distributions,  it  is  only  meaningful  to  calculate  the 
probability  at  a  specific  point  or  for  a  set  of  points,  and  not  for  an  interval  of  points 
as  in  the  case  of  continuous  distributions.  Moreover,  the  mean  of  the  distribution 
itself  can  be  a  non-integer  number,  and  still  the  outcome  of  the  experiment  described 
by  the  Poisson  distribution  can  only  take  integer  values. 

Example  3.5  Consider  an  astronomical  source  known  to  produce  photons,  which 
are  usually  detected  by  a  given  detector  in  the  amount  of  /x  =  2.5  in  a  given  time 
interval.  The  probability  of  detecting  n  —  4  photons  in  a  given  time  interval  is 
therefore 


0.134 


The  reason  for  such  apparently  large  probability  of  obtaining  a  measurement  that 
differs  from  the  expected  mean  is  simply  due  to  the  statistical  nature  of  the  detection 
process.  <> 
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Fig.  3.3  Poisson  distribution  with  /z  =  2,  corresponding  to  a  binomial  distribution  with p  =  0.2 
and  N  =  10.  The  dotted  lines  represent  the  mean,  the  /.i  —  a  and  fi  +  cr  points 


3.3.3  The  Poisson  Distribution  and  the  Poisson  Process 

A  more  formal  justification  for  the  interpretation  of  the  Poisson  distribution  as  the 
distribution  of  counting  experiments  comes  from  the  Poisson  process.  Although 
a  complete  treatment  of  this  subject  is  beyond  the  scope  of  this  book,  a  short 
description  of  stochastic  processes  will  serve  to  strengthen  the  interpretation  of 
(3.18),  which  is  one  of  the  foundations  of  statistics.  More  details  on  stochastic 
processes  can  be  found,  for  example,  in  the  textbook  by  Ross  [38]. 

A  stochastic  counting  process  { N(t )  ,  t  >  0}  is  a  sequence  of  random  variables 
N(t),  in  which  t  indicates  time,  and  N(t)  is  a  random  variable  that  indicates 
the  number  of  events  occurred  up  to  time  t.  The  stochastic  process  can  be 
thought  of  as  repeating  the  experiment  of  “counting  the  occurrence  of  a  given 
event”  at  various  times  t\  N(t)  is  the  result  of  the  experiment.  The  Poisson 
process  with  rate  A  is  a  particular  type  of  stochastic  process,  with  the  following 
properties: 

1 .  N(0)  =  0,  meaning  that  at  time  0  there  are  no  counts  detected. 

2.  The  process  has  independent  increments ,  meaning  that  N(t  +  s)  —  N(s)  is 
independent  of  N(t) ;  this  is  understood  with  the  events  occurring  after  time 
t  not  being  influenced  by  those  occurring  prior  to  it. 
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3.  The  process  has  stationary  increments ,  i.e.,  the  distribution  of  the  number 
of  events  in  an  interval  of  time  s  depends  only  on  the  length  of  the  time 
interval  itself. 

4.  P(N(h)  =  1)  =  A h  +  o(h)  in  which  o(h)  is  a  function  with  the  property 
that 


lim 

h-+0 


#) 

h 


5.  P(N(h )  >  2)  =  o(h).  The  latter  two  properties  mean  that  the  probability  of 
obtaining  one  count  depends  on  the  finite  value  A,  while  it  is  unlikely  that 
two  or  more  events  occur  in  a  short  time  interval. 

It  can  be  shown  that  under  these  hypotheses,  the  number  of  events  N(t) 
recorded  in  any  interval  of  length  t  is  Poisson  distributed, 

(Xt)n 

P{N(t  +  s)  -  N(s )  =  n}  =  (3.19) 

n! 

This  shows  that  the  Poisson  distribution  is  to  be  interpreted  as  the  distribution 
of  occurrence  of  n  events  during  a  time  interval  t,  under  the  hypothesis  that 
the  rate  of  occurrence  of  events  is  A.  This  interpretation  is  identical  to  the 
one  provided  above,  given  that  ji  —  A  ns  the  mean  of  the  counts  in  that  time 
interval. 


3.3.4  An  Example  on  Likelihood  and  Posterior  Probability 
of  a  Poisson  Variable 

The  estimation  of  parameters  of  a  random  variable,  such  as  the  mean  of  the  Poisson 
distribution,  will  be  treated  in  full  detail  in  Chap.  5.  Here  we  present  a  simple 
application  that  consists  of  using  available  measurements  to  calculate  the  likelihood 
and  to  make  inferences  on  the  unknown  value  of  the  parent  mean  g  of  a  Poisson 
variable.  The  following  examples  illustrate  how  a  single  measurement  n  of  a  Poisson 
variable  can  be  used  to  constrain  the  true  mean  /x,  and  that  care  must  be  exercised 
in  not  confusing  the  likelihood  of  a  measurement  with  the  posterior  probability.  We 
assume  for  simplicity  that  the  mean  is  an  integer,  although  in  general  it  may  be  any 
real  number. 

Within  the  Bayesian  framework,  a  counting  experiment  can  be  written  in  terms 
of  a  dataset  B ,  consisting  of  the  measurement  n  of  the  variable,  and  events  A/, 
representing  the  fact  that  the  parent  mean  is  /x  —  i.  It  follows  that  the  likelihood 
can  be  written  as 


P(B/Ai)  = 


•fi 

r 
—  e 
n\ 


—i 
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Example  3.6  ( Calculation  of  Data  Likelihood )  A  counting  experiment  results  in 
a  detection  of  n  —  4  units,  and  one  wants  to  make  a  statement  as  to  what  is  the 
probability  of  such  measurement.  Using  the  Poisson  distribution,  the  probability  of 
detecting  4  counts  if,  for  example,  /x  =  0,  1,  or  2,  is  given  by  the  likelihood 


a4  1  1  24  1 

P(B/Aon)  =  V  —  e~^  =  0+ - + - -  =  0.015  +  0.091  =  0.106, 

v  '  ’  ^4!  4!  e  4!  e2 

jx= 0 

or  10.6  %;  this  is  a  likelihood  of  the  data  with  models  that  assume  a  specific  value 
for  the  mean.  Notice  that  if  the  true  value  of  the  mean  is  zero,  there  is  absolutely 
no  probability  of  detecting  any  counts.  One  can  thus  conclude  that  there  is  slightly 
more  than  a  10  %  chance  of  detecting  4  counts,  given  that  the  source  truly  emits  2 
or  fewer  counts.  This  is  not,  however,  a  statement  of  possible  values  of  the  parent 
mean  /x.  O 

According  to  Bayes’  theorem,  the  posterior  distributions  are 

,  ,  ,  P(B/Ai)P(Ai) 

P(Ai  B)  =  v  ‘  ’  v 

where  P{B/Aj)  is  the  likelihood,  corresponding  to  each  of  the  three  terms  in  the 
sum  of  the  example  above.  In  the  following  example,  we  determine  posterior 
probabilities. 

Example  3.7  ( Posterior  Probability  of  the  Poisson  Mean )  We  want  to  calculate 
the  probability  of  the  true  mean  being  less  or  equal  than  2,  P(Aon/B),  and  start 
by  calculating  the  likelihoods  required  to  evaluate  P(B).  We  make  an  initial  and 
somewhat  arbitrary  assumption  that  the  mean  should  be  /x  <  10,  so  that  only 
1 1  likelihoods  must  be  evaluated.  This  assumption  is  dictated  simply  by  practical 
considerations,  and  can  also  be  stated  in  terms  of  assuming  a  subjective  prior 
knowledge  that  the  mean  is  somehow  known  not  to  exceed  10.  We  calculate 


10  .4 

l 


P(B)  ~  Y]  e~‘  x  P(Ai)  =  °-979  x  P(Ai) 


i= 0 


Also,  assuming  uniform  priors,  we  have  P(A;)  =  1/11  and  that 


P(Aon/B)  = 


P(£t)  ><  E?=0 


1  ^  i4 


P(Ai )  x  0.979  °-979^  41 


=  0.108. 


o 

The  examples  presented  in  this  section  illustrate  the  conceptual  difference  between 
the  likelihood  calculation  and  the  estimate  of  the  posterior,  though  the  two  calcula¬ 
tions  yielded  similar  numerical  values. 
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3.4  Comparison  of  Binomial,  Gaussian,  and  Poisson 
Distributions 

In  this  section  we  provide  numerical  calculations  that  compare  the  binomial 
and  Gaussian  functions,  and  also  discuss  under  what  circumstances  the  Poisson 
distribution  can  be  approximated  by  a  Gaussian  of  same  mean  and  variance.  In  fact 
practical  computations  with  the  Poisson  distribution  are  often  hampered  by  the  need 
to  calculate  the  factorial  of  large  numbers.  In  Sect.  3.2  we  derived  the  Gaussian 
distribution  from  the  binomial  function,  using  the  approximation  that  Np  1 .  In 
fact  we  assumed  that  the  function  has  values  n  1  and,  since  the  mean  of  the 
binomial  is  \i  —  Np ,  the  value  Np  sets  the  order  of  magnitude  for  the  values  of  the 
random  variable  that  have  non-negligible  probability.  In  the  left  panel  of  Fig.  3.4  we 
show  the  binomial  distribution  with  parameters  p  —  q  —  0.5,  showing  that  for  Np  = 
5  the  approximation  is  already  at  the  level  of  1  %  near  the  peak  of  the  distribution. 

The  main  limitation  of  the  Poisson  distribution  (3.18)  is  the  presence  of  the 
factorial  function,  which  becomes  very  rapidly  a  large  number  as  function  of  the 
integer  n  (for  example,  20!  =  2.423  x  1018),  and  it  may  lead  to  overflow  problems 
in  numerical  codes.  For  large  values  of  n ,  one  can  use  the  Stirling  approximation  to 
the  factorial  function,  which  retains  only  the  first  term  of  the  following  expansion: 

7 xn  x  nne~n  (14 - 

V  12  n 

Using  this  approximation  for  values  of  n  >  10,  the  right  panel  of  Fig.  3.4  shows  two 
Poisson  distributions  with  mean  of,  respectively,  3  and  20,  and  the  corresponding 
Gaussian  distributions  with  the  same  mean  and  of  variance  equal  to  the  mean,  as  is 
the  case  for  the  Poisson  distribution.  The  difference  between  the  Gaussian  and  the 


+  . . . 


(3.20) 


CO  .  C\2 


Random  Variable  X  Random  Variable  X 

Fig.  3.4  {Left)  Binomial  distributions  with p  =  q  =  0.5  and,  respectively,  N  =  2  and  N  =  10  as 
points  connected  by  dashed  line.  Matching  Gaussian  distributions  with  same  mean  p  =  Np  and 
variance  a2  =  Npq  ( solid  lines).  (Right)  Gaussian  distribution  with  /x  =  cr2  =  3  and  /x  =  cr2  = 
20  (solid  lines)  and  Poisson  distributions  with  same  mean  as  points  connected  by  a  dotted  line 
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Poisson  distributions  for  a  mean  of  fi  —  20  is  at  the  percent  level  near  the  peak  of 
the  distribution.  The  Poisson  distribution  retains  its  characteristic  asymmetry  and  a 
heavier  tail  at  large  values,  and  therefore  deviations  between  the  two  function  are 
larger  away  from  the  mean  where,  however,  the  absolute  value  of  the  probability 
becomes  negligible.  It  can  also  be  shown  that  for  the  value  of  v  =  /x,  the  two 
distributions  have  the  same  value,  when  the  Stirling  approximation  is  used  for  the 
factorial  function.  A  rule  of  thumb  used  by  many  is  that  for  x  >  20  the  Gaussian 
approximation  to  the  Poisson  distribution  is  acceptable. 

The  approximation  of  a  Poisson  distribution  with  a  Gaussian  distribution  is  of 
great  practical  importance.  Consider  a  counting  experiment  in  which  N  counts  are 
measured.  The  parent  distribution  of  the  random  variable  of  interest  is  Poisson 
distributed  and  it  is  reasonable  to  assume  that  the  best  estimate  of  its  mean  is  /x  =  N 
(but  see  Sect.  5.5.1  for  a  Bayesian  approach  that  gives  a  slightly  different  answer). 
For  values  of  N  >  20  or  so,  the  standard  deviation  of  the  parent  Poisson  distribution 
is  therefore  a  —  \[N .  The  measurement  can  be  reported  at  N  ±  «JN,  where  the 
range  of  N  ±  VN  corresponds  to  the  fi  ±  la  interval  for  a  Gaussian  variable. 


Summary  of  Key  Concepts  for  this  Chapter 

□  Binomial  distribution:  It  describes  the  probability  of  occurrence  of  n 
successes  in  N  tries  of  a  binary  event, 


P{n)  =  £W 


(mean  pN  and  variance  pqN). 

□  Gaussian  distribution :  It  is  an  approximation  of  the  binomial  distribution 
when  N  is  large, 


f(x)dx  — 


1  (*-/d2 

e  2<t  2  dx 

sFhzo1 


(mean  /!  and  variance  a2). 

□  Poisson  distribution :  It  is  an  approximation  of  the  binomial  distribution 
when  p  1  that  describes  the  probability  of  counting  experiments, 


\in 

n! 


e 


-M 


(mean  and  variance  have  a  value  of  /x). 
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Problems 

3.1  Consider  the  Gaussian  distribution 


1  (x-fi)2 

e  2<t  2 

V27ra2 


Calculate  the  mean  and  variance  and  show  that  all  odd  moments  E[(X  —  ii)n]  of 
order  n  >3  are  zero. 

3.2  Assume  that  scores  from  an  I.Q.  test  follow  a  Gaussian  distribution,  and  that  the 
scores  are  standardized  in  such  a  way  that  the  mean  is  {i  —  100,  and  the  standard 
deviation  is  a  =  15. 

(a)  Calculate  the  probability  that  an  I.Q.  score  is  greater  or  equal  than  145. 

(b)  Calculate  the  probability  that  the  mean  I.Q.  score  of  a  sample  of  100  persons, 
chosen  at  random,  is  equal  or  larger  than  105. 

3.3  A  coin  is  tossed  ten  times.  Find 

(a)  The  probability  of  obtaining  5  heads  up  and  5  tails  up; 

(b)  The  probability  of  having  the  first  5  tosses  show  heads  up,  and  the  final  5  tosses 
show  tails  up; 

(c)  The  probability  to  have  at  least  7  heads  up. 

3.4  In  a  given  course,  it  is  known  that  7.3  %  of  students  fail. 

(a)  What  is  the  expected  number  of  failures  in  a  class  of  32  students? 

(b)  What  is  the  probability  that  5  or  more  students  fail? 

3.5  The  frequency  of  twins  in  European  population  is  about  12  in  every  1000 
maternities.  Calculate  the  probability  that  there  are  no  twins  in  200  births,  using 
(a)  the  binomial  distribution,  and  (b)  the  Poisson  distribution. 

3.6  Given  the  distribution  of  a  Poisson  variable  A, 


fin 

n\ 


show  that  the  mean  is  given  by  /x  and  that  the  variance  is  also  given  by  /x. 

3.7  Consider  Mendel’s  experiment  of  Table  1 . 1  at  page  9  and  refer  to  the  “Long  vs. 

short  stem”  data. 

(a)  Determine  the  parent  distribution  for  the  number  of  dominants. 

(b)  Calculate  the  uncertainty  in  the  measurement  of  the  number  of  plants  that 
display  the  dominant  character. 

(c)  Determine  the  difference  between  the  number  of  measured  plants  with  the 
dominant  character  and  the  expected  number,  in  units  of  the  standard  deviation, 
to  show  that  this  number  has  an  absolute  value  of  less  than  one. 


54 


3  Three  Fundamental  Distributions:  Binomial,  Gaussian,  and  Poisson 


3.8  For  Mendel’s  experimental  data  in  Table  1.1  at  page  9,  consider  the  overall 
fraction  of  plants  that  display  the  dominant  character,  for  all  seven  experiments 
combined. 

(a)  Determine  the  parent  distribution  of  the  overall  fraction  X  of  plants  with 
dominant  character  and  its  expected  value. 

(b)  Determine  the  sample  mean  of  the  fraction  X ; 

(c)  Using  the  parent  variance  of  X,  determine  the  value 

x  -  E[X\ 
z  = - 

G 

which  is  the  standardized  difference  between  the  measurement  and  the  mean. 
Assuming  that  the  binomial  distribution  can  be  approximated  by  a  Gaussian  of 
same  mean  and  variance,  calculate  the  probability  of  having  a  value  of  z  equal 
or  smaller  (in  absolute  value)  to  the  measured  value. 


Chapter  4 

Functions  of  Random  Variables  and  Error 
Propagation 


Abstract  Sometimes  experiments  do  not  directly  measure  the  quantity  of  interest, 
but  rather  associated  variables  that  can  be  related  to  the  one  of  interest  by  an  analytic 
function.  It  is  therefore  necessary  to  establish  how  we  can  infer  properties  of  the 
interesting  variable  based  on  properties  of  the  variables  that  have  been  measured 
directly.  This  chapter  explains  how  to  determine  the  probability  distribution  function 
of  a  variable  that  is  function  of  other  variables  of  known  distribution,  and  how  to 
measure  its  mean  and  variance,  the  latter  usually  referred  to  as  error  propagation 
formulas.  We  also  establish  two  fundamental  results  of  the  theory  of  probability,  the 
central  limit  theorem  and  the  law  of  large  numbers. 

4.1  Linear  Combination  of  Random  Variables 

Experimental  variables  are  often  related  by  a  simple  linear  relationship.  The  linear 
combination  of  N  random  variables  Xt  is  a  variable  Y  defined  by 

N 

Y=J2a'X‘  (4.1) 

i=  1 

where  ai  are  constant  coefficients.  A  typical  example  of  a  variable  that  is  a  linear 
combination  of  two  variables  is  the  signal  detected  by  an  instrument,  which  can  be 
thought  of  as  the  sum  of  the  intrinsic  signal  from  the  source  plus  the  background. 
The  distributions  of  the  background  and  the  source  signals  will  influence  the 
properties  of  the  total  signal  detected,  and  it  is  therefore  important  to  understand 
the  statistical  properties  of  this  relationship  in  order  to  characterize  the  signal  from 
the  source. 

4.1.1  General  Mean  and  Variance  Formulas 

The  expectation  or  mean  of  the  linear  combination  is  E[Y] 

N 

fly  -  'y  ^  ajP'j  ■> 

i=  1 
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=  £f=i  aiE[Xi]  or 


(4.2) 
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where  /z*  is  the  mean  of  X(.  This  property  follows  from  the  linearity  of  the 
expectation  operator,  and  it  is  equivalent  to  a  weighted  mean  in  which  the  weights 
are  given  by  the  coefficients  at. 

In  the  case  of  the  variance,  the  situation  is  more  complex: 


/  N  N  \  2 

N 

Var[Y]  =  E 

1  y]  aiXi  -  aiHi  j 

\;=i  ;=  l  / 

=  y>+ 

i=  1 

(Xi  -  pa)2 

N  N 

+  2XX  ai@jE[(Xi  -  Hi)(Xj  -  l-ij)] 
i=  1  7=1+1 

N  N  N 

=  y>+ar(X,)  +  2  £  £  ciiCijCoviXi,  Xj). 

i=l  i=  1  j=i~\~  1 

The  result  can  be  summarized  in  a  more  compact  relationship, 

°y  =  +  2 1]  F  didjdfj.  (4.3) 

i=  1  i=l  j=i~\~  1 

Equation  (4.3)  shows  that  variances  add  only  for  variables  that  are  mutually 
uncorrelated,  or  of  —  0,  but  not  in  general.  The  following  example  illustrates  the 
importance  of  a  non-zero  covariance  between  two  variables,  and  its  effect  on  the 
variance  of  the  sum. 

Example  4.1  ( Variance  of  Anti-correlated  Variables)  Consider  the  case  of  the 
measurement  of  two  random  variables  X  and  Y  that  are  completely  anti-correlated, 
Corr(X,  Y )  =  —1,  with  mean  and  variance  ptx  —  1,  fiy  —  1,  of  =0.5  and  of  =  0.5. 

The  mean  of  Z  —  X  -\-  Y  is  fi  —  1  +  1  —  2  and  the  variance  is  o2  —  of  + 
of  —  2Cov(X,  Y)  =  (oy  —  oy )2  =  0;  this  means  that  in  this  extreme  case  of  complete 
anticorrelation  the  sum  of  the  two  random  variables  is  actually  not  a  random  variable 
any  more.  If  the  covariance  term  had  been  neglected  in  (4.3),  we  would  have  made 
the  error  of  inferring  a  variance  of  1  for  the  sum.  <> 


4.1.2  Uncorrelated  Variables  and  the  1/  V/V  Factor 

For  two  or  more  uncorrelated  variables  the  variances  add  linearly,  according  to  (4.3). 
Uncorrelated  variables  are  common  in  statistics.  For  example,  consider  repeating  the 
same  experiment  a  number  N  of  times  independently,  and  each  time  measurements 
of  a  random  variable  Xi  is  made.  After  N  experiments,  one  obtains  N  measurements 
from  identically  distributed  random  variables  (since  they  resulted  from  the  same 
type  of  experiment).  The  variables  are  independent,  and  therefore  uncorrelated,  if 
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the  experiments  were  performed  in  such  a  way  that  the  outcome  of  one  specific 
experiment  did  not  affect  the  outcome  of  another. 

With  N  uncorrelated  variables  Xt  all  of  equal  mean  fi  and  variance  a2,  one  is 
often  interested  in  calculating  the  relative  uncertainty  in  the  variable 


Y 


(4.4) 


which  describes  the  sample  mean  of  N  measurements.  The  relative  uncertainty  is 
described  by  the  ratio  of  the  standard  deviation  and  the  mean, 


Gy  1  \/  a1  +  •  •  •  +  a2  1  a 

fly  N  fl  fl 


(4.5) 


where  we  used  the  property  that  Var[aX]  =  a2Var[X]  and  the  fact  that  both  means 
and  variances  add  linearly.  The  result  shows  that  the  N  measurements  reduced  the 
relative  error  in  the  random  variable  by  a  factor  of  1  /  y/N,  as  compared  with  a  single 
measurement.  This  observation  is  a  key  factor  in  statistics,  and  it  is  the  reason  why 
one  needs  to  repeat  the  same  experiment  many  times  in  order  to  reduce  the  relative 
statistical  error.  Equation  (4.5)  can  be  recast  to  show  that  the  variance  in  the  sample 
mean  is  given  by 


(4.6) 


where  a  is  the  sample  variance,  or  variance  associated  with  one  measurement.  The 
interpretation  is  simple:  one  expects  much  less  variance  between  two  measurements 
of  the  sample  mean,  than  between  two  individual  measurements  of  the  variable, 
since  the  statistical  fluctuations  of  individual  measurements  average  down  with 
increasing  sample  size. 

Another  important  observation  is  that,  in  the  case  of  completely  correlated 
variables,  then  additional  measurements  introduces  no  advantages,  i.e.,  the  relative 
error  does  not  decrease  with  the  number  of  measurements.  This  can  be  shown  with 
the  aid  of  (4.3),  and  is  illustrated  in  the  following  example. 

Example  4.2  ( Variance  of  Correlated  Variables)  Consider  the  two  measurements  in 
Example  4.1,  but  now  with  a  correlation  of  1 .  In  this  case,  the  covariance  of  the  sum 
is  a2  =  a2  +  a2  +  2Cov(X,  Y )  =  ( gx  +  ay)2,  and  therefore  the  relative  error  in  the 
sum  is 


a  (ax  +  Gy) 


fl  fix  H-  dy 
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which  is  the  same  as  the  relative  error  of  each  measurement.  Notice  that  the  same 
conclusion  applies  to  the  average  of  the  two  measurements,  since  the  sum  and  the 
average  differ  only  by  a  constant  factor  of  l/i.  <> 


4.2  The  Moment  Generating  Function 


The  mean  and  the  variance  provide  only  partial  information  on  the  random  variable, 
and  a  full  description  would  require  the  knowledge  of  all  moments.  The  moment 
generating  function  is  a  convenient  mathematical  tool  to  determine  the  distribution 
function  of  random  variables  and  its  moments.  It  is  also  useful  to  prove  the  central 
limit  theorem,  one  of  the  key  results  of  statistics,  since  it  establishes  the  Gaussian 
distribution  as  the  normal  distribution  when  a  random  variable  is  the  sum  of  a  large 
number  of  measurements. 

The  moment  generating  function  of  a  random  variable  X  is  defined  as 

M(t)  =  E[elX],  (4.7) 


and  it  has  the  property  that  all  moments  can  be  derived  from  it,  provided  they  exist 
and  are  finite.  Assuming  a  continuous  random  variable  of  probability  distribution 
function/(v),  the  moment  generating  function  can  be  written  as 


1  +  t\i i  +  — /X2  T  •  •  • 

A*  • 

and  therefore  all  moments  can  be  obtained  as  partial  derivatives, 


drM(t) 


7=0 


(4.8) 


The  most  important  property  of  the  moment  generating  function  is  that  there 
is  a  one-to-one  correspondence  between  the  moment  generating  function  and  the 
probability  distribution  function,  i.e.,  the  moment  generating  function  is  a  sufficient 
description  of  the  random  variable.  Some  distributions  do  not  have  a  moment 
generating  function,  since  some  of  their  moments  may  be  infinite,  so  in  principle 
this  method  cannot  be  used  for  all  distributions. 
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4.2.1  Properties  of  the  Moment  Generating  Function 

A  full  treatment  of  mathematical  properties  of  the  moment  generating  function  can 
be  found  in  textbooks  on  theory  of  probability,  such  as  [38].  Two  properties  of  the 
moment  generating  function  will  be  useful  in  the  determination  of  the  distribution 
function  of  random  variables: 

•  If  Y  =  a  +  bX ,  where  a,  b  are  constants,  the  moment  generating  function  of  Y  is 

My(t)  =  eatMx(bt).  (4.9) 


Proof  This  relationship  can  be  proved  by  the  use  of  the  expectation 
operator,  according  to  the  definition  of  the  moment  generating  function: 

E[etY]  =  E[e,(a+bx)]  =  E[eateb,x]  =  eatMx(bt). 


□ 

•  If  X  and  Y  are  independent  random  variables,  with  Mx(t)  and  My (t)  as  moment 
generating  functions,  then  the  moment  generating  function  of  Z  =  X  +  Y  is 

Mz(t)  =  Mx(t)My(t).  (4.10) 

Proof  The  relationship  is  derived  immediately  by 

E[e,z]  =  E[e,(x+Y)]  =  Mx{i)My(t). 

□ 


4.2.2  The  Moment  Generating  Function  of  the  Gaussian 
and  Poisson  Distribution 

Important  cases  to  study  are  the  Gaussian  distribution  of  mean  ji  and  variance  a 2 
and  the  Poisson  distribution  of  mean  ji. 

•  The  moment  generating  function  of  the  Gaussian  is  given  by 

M(t )  =  e^+T2f 


(4.11) 
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Proof  Start  with 


1  /»+00  (x~ll)2 

M(t )  =  ,  /  etxe  2<t 2  dx. 

72^2  7-00 

The  exponent  can  be  written  as 

1  x2  +  /x2  —  2  x\i  2  o2tx  —  x2  —  p?  +  2v/x 

2  a2  2a2 

(X  —  fl  —  <J2t)2  2fl(J2t  o2t  2 

2a2  2a2  2  o2 


It  follows  that 


M(t)  = 


=  — ! —  f 

\j2no2  J- 


+  0°  Q2t2  {x—jji— (j2t)2 

e^e^~e~~ 

— oo 


2ct2  dx 


1  o'2?2  / -  / 

=  2  \p2no2  —  e  2 

\l2no2 


o2t2 


/xH~ 


□ 

•  The  moment  generating  function  of  the  Poisson  distribution  is  given  by 


M(t)  = 


(4.12) 


Proof  The  moment  generating  function  is  obtained  by 


M{t)  =  E[e,N ]  =  V 

i“  ft 


00  ..w  00  ( , ,  ,J\n 

^  e~»  =  e 


! 


mi  7  — 


n= 0 


/?=() 


ft 


□ 

Example  4.3  (Sum  of  Poisson  Variables )  The  moment  generating  function  can  be 
used  to  show  that  the  sum  of  two  independent  Poisson  random  variables  of  mean  A 
and  i±  is  a  Poisson  random  variable  with  mean  A +/z.  In  fact  that  mean  of  the  Poisson 
appears  at  the  exponent  of  the  moment  generating  function,  and  property  (4.10), 
can  be  used  to  prove  this  result.  The  fact  that  the  mean  of  two  independent  Poisson 
distributions  will  add  is  not  surprising,  given  that  the  Poisson  distribution  relates  to 
the  counting  of  discrete  events.  <> 
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4.3  The  Central  Limit  Theorem 

The  Central  Limit  Theorem  is  one  of  statistic’s  most  important  results,  establishing 
that  a  variable  obtained  as  the  sum  of  a  large  number  of  independent  variables  has  a 
Gaussian  distribution.  This  result  can  be  stated  as: 

Theorem  4.1  (Central  Limit  Theorem)  The  sum  of  a  large  number  of  indepen¬ 
dent  random  variables  is  approximately  distributed  as  a  Gaussian.  The  mean  of 
the  distribution  is  the  sum  of  the  means  of  the  variables  and  the  variance  of  the 
distribution  is  the  sum  of  the  variances  of  the  variables.  This  result  holds  regardless 
of  the  distribution  of  each  individual  variable. 

Proof  Consider  the  variable  Y  as  the  sum  of  N  variables  Xt  of  mean  /x*  and 
variance  cr2, 


N 

y  =  J2x"  ©b) 

i=  1 

with  Mi  (t)  the  moment  generating  function  of  the  random  variable  {Xt  —  /X;) . 
Since  the  random  variables  are  independent,  and  independence  is  a  stronger 
statement  than  uncorrelation,  it  follows  that  the  mean  of  T  is  /x  =  ^  /x*,  and 
that  variances  likewise  add  linearly,  a2  =  ^cr2.  We  want  to  calculate  the 
moment  generating  function  of  the  variable  Z  defined  by 


1 

<7 


N 

^  '(W  —  /C')- 

i=  1 


The  variable  Z  has  a  mean  of  zero  and  unit  variance.  We  want  to  show  that 
Z  can  be  approximated  by  a  standard  Gaussian.  Using  the  properties  of  the 
moment  generating  function,  the  moment  generating  function  of  Z  is 


N 

M(t)  =  Y]M,(t/a). 

i=  1 


The  moment  generating  function  of  each  variable  (X[  —  /X/) /cr  is 


Mi(t/cr )  —  1  +  fifa-iii) - b  ~r 

(J  Z 


© 


+ 


di,3  (  t 

~V. 


© 


+  .  .  . 


where  /xx/_/X/  =  0  is  the  mean  of  Xt  —  /X/.  The  quantities  a2  and  /X/j  are, 
respectively,  the  central  moments  of  the  second  and  third  order  of  Xt. 
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If  a  large  number  of  random  variables  are  used,  N  1 ,  then  a2  is  large,  as 
it  is  the  sum  of  variances  of  the  random  variables,  and  we  can  ignore  terms  of 
order  a-3.  We  therefore  make  the  approximation 

In M(t)  =  ^  In M(-  = 


This  results  in  the  approximation  of  the  moment  generating  function  of 

(y  ~  AO/ff  as 


=4>  M(t)  —  e  2  , 

which  shows  that  Z  is  approximately  distributed  as  a  standard  Gaussian 
distribution,  according  to  (4.11).  Given  that  the  random  variable  of  interest 
Y  is  obtained  by  a  change  of  variable  Z  =  (Y  —  fi)/a,  we  also  know  that 
fly  —  fi  and  Var(Y )  =  Var(oZ )  =  o2Var(Z)  =  a2,  therefore  Y  is  distributed 
as  a  Gaussian  with  mean  fi  and  variance  a2 .  □ 

The  central  limit  theorem  establishes  that  the  Gaussian  distribution  is  the  limiting 
distribution  approached  by  the  sum  of  random  variables,  no  matter  their  original 
shapes,  when  the  number  of  variables  is  large.  A  particularly  illustrative  example 
is  the  one  presented  in  the  following,  in  which  we  perform  the  sum  of  a  number 
of  uniform  distributions.  Although  the  uniform  distribution  does  not  display  the 
Gaussian-like  feature  of  a  centrally  peaked  distribution,  with  the  increasing  number 
of  variables  being  summed,  the  sum  rapidly  approaches  a  Gaussian  distribution. 

Example  4.4  (Sum  of  Uniform  Random  Variables)  We  show  that  the  sum  of  N 
independent  uniform  random  variables  between  0  and  1  tend  to  a  Gaussian  with 
mean  N / 2,  given  that  each  variable  has  a  mean  of  l/i.  The  calculation  that  the  sum 
of  N  uniform  distribution  tends  to  the  Gaussian  can  be  done  by  first  calculating  the 
moment  generating  function  of  the  uniform  distribution,  then  using  the  properties 
of  the  moment  generating  function. 

We  can  show  that  the  uniform  distribution  in  the  range  [0, 1]  has  /r,  =  t/2,  a2  = 
1/12,  and  a  moment  generating  function 


Mft)  = 


1). 

t 


the  sum  of  N  independent  such  variables  therefore  has  fi  —  N / 2  and  a2  =  N / 12. 
To  prove  that  the  sum  is  asymptotically  distributed  like  a  Gaussian  with  this  mean 
and  variance,  we  must  show  that 
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Proof  Using  the  property  of  the  moment  generating  function  of  independent 
variables,  we  write 


Mil)  =  M,(tf  =  I  ~{e'  -  1) 

1  +t+t2/2\  +  t3/3\ 


1 


N 


-i\"  (  t  f  " 


Neglect  terms  of  order  0(t 3)  and  higher,  and  work  with  logarithms: 


\n(M(tf)  ~  N  In 


H+0 


Use  the  Taylor  series  expansion  ln(l  +  x)  2^  (x  —  x2  /  2  +  ...),  to  obtain 

( t  t2  1  ft  t2'  2 
ln(Ml(t))  +  -  ■ -  -  (-  +  ? 

N(t/2  +  t2/6-t2/S  +  0(t 3))  ~  N(t/2  +  t1 1 24) 

in  which  we  continued  neglecting  terms  of  order  0(t3).  The  equation  above 
shows  that  the  moment  generating  function  can  be  approximated  as 


M(t)  ^  e 


(4.14) 


which  is  in  fact  the  moment  generating  function  of  a  Gaussian  with  mean  N/2 
and  variance  N/ 12.  □ 

In  Figure  4. 1  we  show  the  simulations  of,  respectively,  1000  and  100,000  samples 
drawn  from  N  —  100  uniform  and  independent  variables  between  0  and  1.  The 
sample  distributions  approximate  well  the  limiting  Gaussian  with  [i  —  N/2 , 
a  =  y/N/12.  The  approximation  is  improved  when  a  larger  number  of  samples 
are  drawn,  also  illustrating  the  fact  that  the  sample  distribution  approximates  the 
parent  distribution  in  the  limit  of  a  large  number  of  samples  collected.  <> 


Example  4.5  (Sum  of  Two  Uniform  Distributions)  An  analytic  way  to  develop  a 
practical  sense  of  how  the  sum  of  non-Gaussian  distributions  progressively  develops 
the  peaked  Gaussian  shape  can  be  illustrated  with  the  sum  of  just  two  uniform 
distributions.  We  start  with  a  uniform  distribution  in  the  range  of  —1  to  1,  which 
can  be  shown  to  have 


M(t)  =  1  /  (2t)  (e*  -  e-f). 
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Sum  of  100  uniform  variables 


Fig.  4.1  Sample  distribution  functions  of  the  sum  of  N  =  100  independent  uniform  variables 
between  0  and  1,  constructed  from  1000  simulated  measurements  ( grey  histograms)  and  100,000 
measurements  (histogram  plot  with  black  outline ).  The  solid  curve  is  the  N(fi,  a)  Gaussian,  with 
/x  =  N/2,  a  =  y/N/12,  the  limiting  distribution  according  to  the  Central  Limit  Theorem 


The  sum  of  two  such  variables  will  have  a  triangular  distribution ,  given  by  the 
analytical  form 


if  —  2  <  v  <  0 
if  0  <  v  <  2. 

This  is  an  intuitive  result  that  can  be  proven  by  showing  that  the  moment  generating 
function  of  the  triangular  distribution  is  equal  to  M(t )2  (see  Problem  4.3).  The 
calculation  follows  from  the  definition  of  the  moment  generating  function  for  a 
variable  of  known  distribution  function.  The  triangular  distribution  is  the  first  step 
in  the  development  of  a  peaked,  Gaussian-like  distribution.  <> 


4.4  The  Distribution  of  Functions  of  Random  Variables 

The  general  case  of  a  variable  that  is  a  more  complex  function  of  other  variables  can 
be  studied  analytically  when  certain  conditions  are  met.  In  this  book  we  present 
the  method  of  change  of  variables  which  can  be  conveniently  applied  to  one¬ 
dimensional  transformations  and  a  method  based  on  the  cumulative  distribution 
function  which  can  be  used  for  multi-dimensional  transformations.  Additional 
information  on  this  subject  can  be  found,  e.g.,  in  the  textbook  by  Ross  [38]. 
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4.4.1  The  Method  of  Change  of  Variables 

A  simple  method  for  obtaining  the  probability  distribution  function  of  the  dependent 
variable  Y  —  Y{X)  is  by  using  the  method  of  change  of  variables,  which  applies  only 
if  the  function  T(v)  is  strictly  increasing.  In  this  case  the  probability  distribution  of 
g(y)  of  the  dependent  variable  is  related  to  the  distribution /(A)  of  the  independent 
variable  via 


g(y)=f(f)—  (4.15) 

ay 

In  the  case  of  a  decreasing  function,  the  same  method  can  be  applied  but  the  term 
dx/dy  must  be  replaced  with  the  absolute  value,  \dx/dy\. 

Example  4.6  Consider  a  variable  X  distributed  as  a  uniform  distribution  between  0 
and  1,  and  the  variable  Y  —  X2.  The  method  automatically  provides  the  information 
that  the  variable  Y  is  distributed  as 


gCv)  =  r 

with  0  <  y  <  1 .  You  can  prove  that  the  distribution  is  properly  normalized  in  this 
domain.  O 

The  method  can  be  naturally  extended  to  the  joint  distribution  of  several  random 
variables.  The  multi-variable  version  of  (4.15)  is 

g(u,v)  =  h(x,y)\J\  (4.16) 


in  which 


dx  dx\ 

du  dv 
dy  dy 

du  dv  / 

is  the  Jacobian  of  the  transformation,  in  this  case  a  2  by  2  matrix,  h(x,y )  is  the 
joint  probability  distribution  of  the  independent  variables  X,  Y,  and  U,  V  are  the 
new  random  variables  related  to  the  original  ones  by  a  transformation  U  —  u(X,Y ) 
and  V  —  v(X,Y). 

Example  4.7  ( Transformation  of  Cartesian  to  Polar  Coordinates )  Consider  two 
random  variables  X,Y  distributed  as  standard  Gaussians,  and  independent  of  one 
another.  The  joint  probability  distribution  function  is 

1  %2+y2 

— e  2  . 

2tt 


h(x,y )  = 
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Consider  a  transformation  of  variables  from  Cartesian  coordinates  x,  y  to  polar 
coordinates  r,  0 ,  described  by 


x  —  r  -  cos(0) 
y  —  r  •  sin (0) 

The  Jacobian  of  the  transformation  is 


f  cosO  —rsinO 
\sinO  rcosO 


and  its  determinant  is  |/|  =  r.  Notice  that  to  apply  the  method  described  by  (4.16) 
one  only  needs  to  know  the  inverse  transformation  of  (v,y)  as  function  of  (r,  0).  It 
follows  that  the  distribution  of  (r,  0)  is  given  by 


g(r,  0) 


1  _d 

—  re  2 
2tt 


for  r  >  0,  0  <  0  <  2tt.  The  distribution  re  2  is  called  the  Rayleigh  distribution , 
and  1/2 n  can  be  interpreted  as  a  uniform  distribution  for  the  angle  0  between  0 
and  tv.  One  important  conclusion  is  that,  since  g(r,  0)  can  be  factored  out  into  two 
functions  that  contain  separately  the  two  variables  r  and  0 ,  the  two  new  variables 
are  also  independent.  <> 


4.4.2  A  Method  for  Multi-dimensional  Functions 


We  will  consider  the  case  in  which  the  variable  Z  is  a  function  of  two  random 
variables  X  and  Y,  since  this  is  a  case  of  common  use  in  statistics,  e.g.,  X  +  Y, 
or  X/Y.  We  illustrate  the  methodology  with  the  case  of  the  function  Z  =  X  +  Y, 
when  the  two  variables  are  independent.  The  calculation  starts  with  the  cumulative 
distribution  function  of  the  random  variable  of  interest, 


Fz(a)  =  P(Z  < 


a)=ll 


x-\-y<a 


f(x)g(y)dxdy 


in  which  f(x)  and  g(y)  are,  respectively,  the  probability  distribution  functions  of 
X  and  Y,  and  the  limits  of  integration  must  be  chosen  so  that  the  sum  of  the  two 
variables  is  less  or  equal  than  a.  The  portion  of  parameter  space  such  that  v  +  y  <  a 
includes  all  values  x  <  a  —  y,  for  any  given  value  of  y,  or 


Fz(a)  = 


/-Too  pa—y 

dy  /  . 

-oo  2  oo 


f(x)g(y)dx  = 


/  +  oo 
-oo 


g(y)dyFx(a  -  y) 
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where  Fx  is  the  cumulative  distribution  for  the  variable  X.  It  is  often  more  convenient 
to  express  the  relationship  in  terms  of  the  probability  distribution  function,  which  is 
related  to  the  cumulative  distribution  function  via  a  derivative, 

fz(a)  =  —Fz(a)  =  \  f(a—  y)g(y)dy.  (4.17) 

del  J - QQ 

This  relationship  is  called  the  convolution  of  the  distributions /(v)  and  g(y). 

Example  4.8  (Sum  of  Two  Independent  Uniform  Variables)  Calculate  the  probabil¬ 
ity  distribution  function  of  the  sum  of  two  independent  uniform  random  variables 
between  —  1  and  + 1 . 

The  probability  distribution  function  of  a  uniform  variable  between  —  1  and  + 1 
is  f(x)  —  l/2 ,  defined  for  —1  <  v  <  1.  The  convolution  gives  the  following  integral 

r+l  l 

fz(a)  =  J  - f(a  -  y)dy. 

The  distribution  function  of  the  sum  Z  can  have  values  —2  <  a  <  2,  and  the 
convolution  must  be  divided  into  two  integrals,  sine ef(a—y)  is  only  defined  between 
—  1  and  + 1 .  We  obtain 


if  —  2  <  a  <  0 
if  0  <  a  <  2. 


This  results  in 


(a  +  2) 
(2  —  a) 


if  —  2  <  a  <  0 
if  0  <  a  <  2 


which  is  the  expected  triangular  distribution  between  —2  and  +2.  <> 

Another  useful  application  is  for  the  case  of  Z  =  X/Y,  where  X  and  Y  are 
again  independent  variables.  We  begin  with  the  cumulative  distribution, 

Fz(z)  =  P(Z  <z)=  P(X/Y  <z)=  P{X  <  zY). 

For  a  given  value  y  of  the  random  variable  Y ,  this  probability  equals  Fx(zy); 
since  Y  has  a  prob ability  fy(y)dy  to  be  in  the  range  between  y  and  y  +  dy,  we 
obtain 


Fx(zy)fY(y)dy. 
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Following  the  same  method  as  for  the  derivation  of  the  distribution  of  X  +  Y, 
we  must  take  the  derivative  of  Fz(z)  with  respect  to  z  to  obtain: 

Mz)  =  J  fx(zy)yfr(y)dy.  (4.18) 

This  is  the  integral  than  must  be  solved  to  obtain  the  distribution  of  X/Y. 


4.5  The  Law  of  Large  Numbers 

Consider  N  random  variables  that  are  identically  distributed,  and  fi  is  their 
common  mean.  The  Strong  Law  of  Large  Numbers  states  that,  under  suitable 
conditions  on  the  variance  of  the  random  variables,  the  sum  of  the  N  variables  tends 
to  the  mean  /z,  which  is  a  deterministic  number  and  not  a  random  variable.  This 
result  can  be  stated  as 


lim 

n— >oo 


X\  +  . . .  +  Xn 
N 


(4.19) 


and  it  is,  together  with  the  Central  Limit  Theorem,  one  of  the  most  important  results 
of  the  theory  of  probability,  and  of  great  importance  for  statistics.  Equation  (4.19) 
is  a  very  strong  statement  because  it  shows  that,  asymptotically,  the  sum  of  random 
variables  becomes  a  constant  equal  to  the  sample  mean  of  the  N  variables,  or  N 
measurements.  Although  no  indication  is  given  towards  establishing  how  large  N 
should  be  in  order  to  achieve  this  goal,  it  is  nonetheless  an  important  result  that  will 
be  used  in  determining  the  asymptotic  behavior  of  random  variables.  Additional 
mathematical  properties  of  this  law  can  be  found  in  books  of  theory  of  probability, 
such  as  [38]  or  [26]. 

Instead  of  providing  a  formal  proof  of  this  law,  we  want  to  focus  on  an  important 
consequence.  Given  a  function  y(x),  we  would  like  to  estimate  its  expected  value 
E\y(X)]  from  the  N  measurements  of  the  variables  Xt.  According  to  the  law  of  large 
numbers,  we  can  say  that 


lim  .y(Xi)  +  ■  ■  •  +  y(XN)  =  (4.20) 

n-^oo  N 

Equation  (4.20)  states  that  a  large  number  of  measurements  of  the  variables  Xt  can 
be  used  to  measure  the  expectation  of  E\y(X)],  entirely  bypassing  the  probability 
distribution  function  of  the  function  y(X).  This  property  is  used  in  the  following 
section. 


4.6  The  Mean  of  Functions  of  Random  Variables 
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For  a  function  of  random  variables  it  is  often  necessary  or  convenient  to  develop 
methods  to  estimate  the  mean  and  the  variance  without  having  full  knowledge  of  its 
probability  distribution  function. 

For  functions  of  a  single  variable  Y  =  y(X) 


EW 0]  =  f 


y(x)f(x)dx 


(4.21) 


where  f(x)  is  the  distribution  function  of  X.  This  is  in  fact  a  very  intuitive  result, 
stating  that  the  distribution  function  of  X  is  weighted  by  the  function  of  interest,  and 
it  makes  it  straightforward  to  compute  expectation  values  of  variables  without  first 
having  to  calculate  their  full  distribution.  According  to  the  law  of  large  numbers, 
this  expectation  can  be  estimated  from  N  measurements  Xj  as  per  (4.20), 


y(*i)  +  . . .  +  y(x„) 
N 


(4.22) 


An  important  point  is  that  the  mean  of  the  function  is  not  equal  to  the  function  of  the 
mean,  y(x)  ^  y(T),  as  will  be  illustrated  in  the  following  example.  Equation  (4.22) 
says  that  we  must  have  access  to  the  individual  measurements  of  the  variable  X ,  if 
we  want  to  make  inferences  on  the  mean  of  a  function  of  X.  If,  for  example,  we  only 
had  the  mean  x,  we  cannot  measure  u(x).  This  point  is  relevant  when  one  has  limited 
access  to  the  data,  e.g.,  when  the  experimenter  does  not  report  all  information  on  the 
measurements  performed. 

Example  4.9  (Mean  of  Square  of  a  Uniform  Variable )  Consider  the  case  of  a 
uniform  variable  U  in  the  range  0-1,  with  mean  l/i.  If  we  want  to  evaluate  the 
parent  mean  of  X  —  U2,  we  calculate 


fl  2 

II  —  I  u  du  —  l/3. 

Jo 

It  is  important  to  see  that  the  mean  of  U 2  is  not  just  the  square  of  the  mean  of  U, 
and  therefore  the  means  do  not  transform  following  the  same  analytic  expression  as 
the  random  variables.  You  can  convince  yourself  of  this  fact  by  assuming  to  draw 
five  “fair”  samples  from  a  uniform  distribution,  0.1, 0.3, 0.5,  0.7  and  0.9 — they  can 
be  considered  as  a  dataset  of  measurements.  Clearly  their  mean  is  t/2,  but  the  mean 
of  their  squares  is  ]/3  and  not  i/4,  in  agreement  with  the  theoretical  calculation  of  the 
parent  mean.  O 

Another  example  where  the  mean  of  the  function  does  not  equal  to  the  function 
of  the  mean  is  reported  in  Problem  4.5,  in  which  you  can  show  that  using  the  means 
of  I  and  W/Q  do  not  give  the  mean  of  m/e  for  the  Thomson  experiment  to  measure 
the  mass  to  charge  ratio  of  the  electron.  The  problem  provides  a  multi-dimensional 
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extension  to  (4.22),  since  the  variable  m/e  is  a  function  of  two  variables  that  have 
been  measured  in  pairs. 


4.7  The  Variance  of  Functions  of  Random  Variables 
and  Error  Propagation  Formulas 


A  random  variable  Z  that  is  a  function  of  other  variables  can  have  its  variance 
estimated  directly  if  the  measurements  of  the  independent  variables  are  available, 
similar  to  the  case  of  the  estimation  of  the  mean.  Considering,  for  example,  the 
case  of  a  function  Z  =  z(U)  that  depends  on  just  one  variable,  for  which  we  have 
N  measurements  u\,  ...,  available.  With  the  mean  estimated  from  (4.22),  the 
variance  can  accordingly  be  estimated  as 


( z(u\ )  -  z)2  +  . . .  +  (z(uN)  -  z)2 


N-  1 


9 


(4.23) 


as  one  would  normally  do,  treating  the  numbers  z(u\),  . . . ,  z(un)  as  samples  from 
the  dependent  variable.  This  method  can  naturally  be  extended  to  more  than  one 
variable,  as  illustrated  in  the  following  example.  When  the  measurements  of  the 
independent  variables  are  available,  this  method  is  the  straightforward  way  to 
estimate  the  variance  of  the  function  of  random  variables. 

Example  4.10  Using  the  Thomson  experiment  described  on  page  23,  consider  the 
data  collected  for  Tube  1 ,  consisting  of  1 1  measurements  of  W/Q  and  /,  from  which 
the  variable  of  interest  v  is  calculated  as 


v 


W/Q 

I 


From  the  reported  data,  one  obtains  11  measurements  of  v,  from  which  the  mean 
and  standard  deviation  can  be  immediately  calculated  as  v  =  7.9  x  109,  and  sv  — 
2.8  x  109.  O 

There  are  a  number  of  instances  in  which  one  does  not  have  access  to  the  original 
measurements  of  the  independent  variable  or  variables,  required  for  an  accurate 
estimate  of  the  variance  according  to  (4.23).  In  this  care,  an  approximate  method 
to  estimate  the  variance  must  be  used  instead.  This  method  takes  the  name  of 
error  propagation.  Consider  a  random  variable  Z  that  is  a  function  of  a  number 
of  variables,  Z  =  z(U,  V, ...).  A  method  to  approximate  the  variance  of  Z  in  terms 
of  the  variance  of  the  independent  variables  U,  V ,  etc.  starts  by  expanding  Z  in  a 
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Taylor  series  about  the  means  of  the  independent  variables,  to  obtain 


z(u,  V,..  .)  =  z(jlu,  Hv,  ...)  +  («-  IJ-u) 


dz 

du 


+  (v  -  IXV) 


l*U 


dz 

dv 


1*1 


+  . . .  +  0(u  —  fiu)2  T  0(y  —  fiv)2  +  •  •  • 


Neglecting  terms  of  the  second  order,  the  expectation  of  Z  would  be  given 
by  E[Z\  —  z(fiu,  fiv, . . .),  i.e.,  the  mean  of  X  would  be  approximated  as  fix  — 
z(fiu,  fiv, . . .).  This  is  true  only  if  the  function  is  linear,  and  we  have  shown  in 
Sect.  4.6  that  this  approximation  may  not  be  sufficiently  accurate  in  the  case  of 
nonlinear  functions  such  as  U2.  This  approximation  for  the  mean  is  used  to  estimate 
the  variance  of  Z,  for  which  we  retain  only  terms  of  first  order  in  the  Taylor 
expansion: 


E[(Z  -  E[Z ])2]  22  E 


(u  -  flu) 


dz 

du 


+  (V  ~  flv) 


l*u 


dz 

dv 


+  . . . 


/xt 


22  E 


0 U  -  flu) 


dz 

du 


(V  -  fly) 


IV 


dz 

dv 


Hi 


+  2  (u  —  fiu) 


dz 

du 


,  dz 
■(V  Uv)  a 

,,  0  V 

1 

+ 

l*u 

l*v 

This  formula  can  be  rewritten  as 


2  _  2 
%  -  ^ 


d£ 

du 


+  ( 7 


v 


l*u 


dv 


+  2- a, 


2  9/ 


UV 


1*1 


du 


i*i 


df_ 

dv 


+ 


i*i 


(4.24) 


which  is  usually  referred  to  as  the  error  propagation  formula,  and  can  be  used 
for  any  number  of  independent  variables.  This  result  makes  it  possible  to  estimate 
the  variance  of  a  function  of  variable,  knowing  simply  the  variance  of  each  of  the 
independent  variables  and  their  covariances.  The  formula  is  especially  useful  for  all 
cases  in  which  the  measured  variables  are  independent,  and  all  that  is  known  is  their 
mean  and  standard  deviation  (but  not  the  individual  measurements  used  to  determine 
the  mean  and  variance).  This  method  must  be  considered  as  an  approximation  when 
there  is  only  incomplete  information  about  the  measurements.  Neglecting  terms  of 
the  second  order  in  the  Taylor  expansion  can  in  fact  lead  to  large  errors,  especially 
when  the  function  has  strong  nonlinearities.  In  the  following  we  provide  a  few 
specific  formulas  for  functions  that  are  of  common  use. 
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4.7.1  Sum  of  a  Constant 

Consider  the  case  in  which  a  constant  a  is  added  to  the  variable  U, 

Z  —  U  ci 

where  a  is  a  deterministic  constant  which  can  have  either  sign.  It  is  clear  that 
dz/da  —  0,  dz/du  =  1,  and  therefore  the  addition  of  a  constant  has  no  effect  on 
the  uncertainty  of  X , 


(4.25) 


The  addition  or  subtraction  of  a  constant  only  changes  the  mean  of  the  variable 
by  the  same  amount,  but  leaves  its  standard  deviation  unchanged. 


4. 7.2  Weighted  Sum  of  Two  Variables 

The  variance  of  the  weighted  sum  of  two  variables, 

Z  =  aU  +  bV 

where  a,  b  are  constants  of  either  sign,  can  be  calculated  using  dz/du  =  a , 
dz/dv  =  b.  We  obtain 

+  b2<Jv  +  2abauv  ■  (4-26) 

The  special  case  in  which  the  two  variables  U,  V  are  uncorrelated  leads  to  the 
weighted  sum  of  the  variances. 

Example  4.11  Consider  a  decaying  radioactive  source  which  is  found  to  emit  N\  — 
50  counts  and  =  35  counts  in  two  time  intervals  of  same  duration,  during  which 
B  —  20  background  counts  are  recorded.  This  is  an  idealized  situation  in  which  we 
have  directly  available  the  measurement  of  the  background  counts.  In  the  majority 
of  real-life  experiments  one  simply  measures  the  sum  of  signal  plus  background, 
and  in  those  cases  additional  considerations  must  be  used.  We  want  to  calculate  the 
background  subtracted  source  counts  in  the  two  time  intervals  and  estimate  their 
signal-to-noise  ratio ,  defined  as  S/N  =  / i/o .  The  inverse  of  the  signal-to-noise 
ratio  is  the  relative  error  of  the  variable. 

Each  random  variable N\,N2,  and  B  obeys  the  Poisson  distribution,  since  it  comes 
from  a  counting  process.  Therefore,  we  can  estimate  the  following  parent  means  and 
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variances  from  the  sample  measurements, 


—  50  g  i  —  V50  =  7.1 

<112  —  55  02  —  V35  =  5.9 

,  [iB  —  20  gb  =  V20  =  4.5 

Since  the  source  counts  are  given  by  S\  =  A^i  —  ^  and  S2  =  N2—  B ,  we  can  now 

use  the  approximate  variance  formulas  assuming  that  the  variables  are  uncorrelated, 
Gsx  —  \/50  +  20  =  8.4  and  gs2  —  V55  +  20  =  7.4.  The  two  measurements 
of  the  source  counts  would  be  reported  as  Si  =  30=b8.4  and  S2  =  15  =b  7.4, 
from  which  the  signal-to-noise  ratios  are  given,  respectively,  as  fisi/c^Si  —5.6  and 
l^si!  &s2  =  2.0.  ❖ 


4. 7.3  Product  and  Division  of  Two  Random  Variables 

Consider  the  product  of  two  random  variables  U,  V ,  optionally  also  with  a  constant 
factor  a  of  either  sign, 


Z  =  aUV. 


(4.27) 


The  partial  derivatives  are  dz/du  =  av ,  dz/dv  =  au ,  leading  to  the  approximate 
variance  of 


2  222,  222,0  2 
g7  =  avou-\-auov-\-  2auvouv 


This  can  be  rewritten  as 


o:  a: 


2  2 

G  G 

'  z  ZJL  _|_  v  _|_  2  — 

z 2  u2  V 2  uv 


(4.28) 


Similarly,  the  division  between  two  random  variables, 


U 

Z  =  a — , 
V 


(4.29) 


leads  to 


g: 


^2  _ 2  ^2 

G  G  G 

U  V  rx  ^  uv 


t; 


uv 


(4.30) 
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Notice  the  equations  for  product  and  division  differ  by  just  one  sign,  meaning 
that  a  positive  covariance  between  the  variables  leads  to  a  reduction  in  the  standard 
deviation  for  the  division,  and  an  increase  in  the  standard  deviation  for  the  product. 

Example  4.12  Using  the  Thomson  experiment  of  page  23,  consider  the  data  for 
Tube  1,  and  assume  that  the  only  number  available  are  the  mean  and  standard 
deviation  of  W/Q  and  I.  From  these  two  numbers  we  want  to  estimate  the  mean  and 
variance  of  v.  The  measurement  of  the  two  variables  are  W/Q  —  13.3  ±  8.5  x  1011 
and  I  —  312.9  =t  93.4,  from  which  the  mean  of  v  would  have  to  be  estimated  as 
v  =  8.5  x  109 — compare  with  the  value  of  7.9  x  109  obtained  from  the  individual 
measurements. 

The  estimate  of  the  variance  requires  also  a  knowledge  of  the  covariance  between 
the  two  variables  W/Q  and  I.  In  the  absence  of  any  information,  we  will  assume  that 
the  two  variables  are  uncorrelated,  and  use  the  error  propagation  formula  to  obtain 


~  2  x 


13.3  x  1011 
312.9 


X 


=  6x  109, 


which  is  a  factor  of  2  larger  than  estimated  directly  from  the  data  (see  Exam¬ 
ple  4.10).  Part  of  the  discrepancy  is  to  be  attributed  to  the  neglect  of  the  covariance 
between  the  measurement,  which  can  be  found  to  be  positive,  and  therefore 
would  reduce  the  variance  of  v  according  to  (4.30).  Using  this  approximate 
method,  we  would  estimate  the  measurement  as  u  =  8.5  ±  6  x  109,  instead  of 
7.9  ±  2.8  x  109.  O 


4.7.4  Power  of  a  Random  Variable 

A  random  variable  may  be  raised  to  a  constant  power,  and  optionally  multiplied  by 
a  constant, 


Z  =  aUb  (4.31) 

where  a  and  b  are  constants  of  either  sign.  In  this  case,  dz/du  —  abuh~ 1  and  the 
error  propagation  results  in 


—  =  \b\  —  .  (4.32) 

Z  u 


This  results  states  that  the  relative  error  in  Z  is  b  times  the  relative  error  in  U. 
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4.7.5  Exponential  of  a  Random  Variable 

Consider  the  function 


Z  =  aebu,  (4.33) 

where  a  and  b  are  constants  of  either  sign.  The  partial  derivative  is  dz/du  —  abebu, 
and  we  obtain 


-  =  I  b\au.  (4.34) 

z 


4.7.6  Logarithm  of  a  Random  Variable 

For  the  function 


Z  =  a]n(bU),  (4.35) 

where  a  is  a  constant  of  either  sign,  and  b  >  0.  The  partial  derivative  is  dz/du  = 
a/U ,  leading  to 


a 


u 


A  similar  result  applies  for  a  base- 10  logarithm, 


Z  =  a\og{bU), 


(4.36) 


(4.37) 


where  a  is  a  constant  of  either  sign,  and  b  >  0.  The  partial  derivative  is  dz/du  = 
a/ (t/ln(10)),  leading  to 


Vu 

u\n(\0) 


(4.38) 


Similar  error  propagation  formulas  can  be  obtained  for  virtually  any  analytic 
function  for  which  derivatives  can  be  calculated.  Some  common  formulas  are 
reported  for  convenience  in  Table  4.1,  where  the  terms  z,  u,  and  v  refer  to  the  random 
variables  evaluated  at  their  estimated  mean  value. 

Example  4.13  With  reference  to  Example  4.11,  we  want  to  give  a  quantitative 
answer  to  the  following  question:  what  is  the  probability  that  during  the  second 
time  interval  the  radioactive  source  was  actually  detected?  In  principle  a  fluctuation 
of  the  number  of  background  counts  could  give  rise  to  all  detected  counts. 
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Table  4.1  Common  error  propagation  formulas 


Function 

Error  propagation  formula 

Notes 

Z=U  +  a 

II 

a  is  a  constant 

Z  =  aU  +  bV 

al  =  al°l  +  h2frb  +  1abolv 

a,  b  are  constants 

Z  =  aUV 

a7  at  at  at., 

Z  _  Mi  V  |  ^  UV 

z2  u2  V2  UV 

a  is  a  constant 

U 

Z  =  a  — 

V 

2  9  9  9 

az  _  UV 

z2  U2  V2  UV 

a  is  a  constant 

Z  =  aUb 

a  z  ^  a u 

z  u 

a,  b  are  constants 

-O 

II 

N 

^  Lq 

11 

IF 

a,  b  are  constants 

Z  =  a\n{bU) 

1  ia" 
az  =  \a\ 

u 

a,  b  are  constants,  b  >  0 

Z  =  a\og{bU) 

1  1 

a,  b  are  constants,  b  >  0 

O  7  Cl 

wln(10) 

A  solution  to  this  question  can  be  provided  by  stating  the  problem  in  a  Bayesian 
way: 


P(detection)  =  P(S2  >  0/data) 

where  the  phrase  “data”  refers  also  to  the  available  measurement  of  the  background 
where  S2  —  N2  —  B  is  the  number  of  source  counts.  This  could  be  elaborated  by 
stating  that  the  data  were  used  to  estimate  a  mean  of  15  and  a  standard  deviation 
of  7.4  for  S2,  and  therefore  we  want  to  calculate  the  probability  to  exceed  zero  for 
such  random  variable.  We  can  use  the  Central  Limit  Theorem  to  say  that  the  sum 
of  two  random  variables — each  approximately  distributed  as  a  Gaussian  since  the 
number  of  counts  is  sufficiently  large — is  Gaussian,  and  the  probability  of  a  positive 
detection  of  the  radioactive  source  therefore  becomes  equivalent  to  the  probabil¬ 
ity  of  a  Gaus sian-distributed  variable  to  have  values  larger  than  approximately 
/z  —  2a.  According  to  Table  A. 3,  this  probability  is  approximately  97.7  %.  We  can 
therefore  conclude  that  source  were  detected  in  the  second  time  period  with  such 
confidence.  <> 


4.8  The  Quantile  Function  and  Simulation  of  Random 
Variables 

In  data  analysis  one  often  needs  to  simulate  a  random  variable,  that  is,  drawing 
random  samples  from  a  parent  distribution.  The  simplest  such  case  is  the  generation 
of  a  random  number  between  two  limits,  which  is  equivalent  to  drawing  samples 
from  a  uniform  distribution.  In  particular,  several  Monte  Carlo  methods  including 
the  Markov  chain  Monte  Carlo  method  discussed  in  Chap.  16  will  require  random 
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variables  with  different  distributions.  Most  computer  languages  and  programs  do 
have  available  a  uniform  distribution,  and  thus  it  is  useful  to  learn  how  to  simulate 
any  distribution  based  on  the  availability  of  a  simulator  for  a  uniform  variable. 

Given  a  variable  X  with  a  distribution/^)  and  a  cumulative  distribution  function 
F(x ),  we  start  by  defining  the  quantile  function  F  1  ( p )  as 

F~l(p)  —  min {xs&,p  <  F(v)}  (4.39) 

with  the  meaning  that  v  is  the  minimum  value  of  the  variable  at  which  the  cumulative 
distribution  function  reaches  the  value  0  <  p  <  1.  The  word  “minimum”  in  the 
definition  of  the  quantile  function  is  necessary  to  account  for  those  distributions  that 
have  steps — or  discontinuities — in  their  cumulative  distribution,  but  in  the  more 
common  case  of  a  strictly  increasing  cumulative  distribution,  the  quantile  function 
is  simply  defined  by  the  relationship  p  —  F(x).  This  equation  can  be  solved  for  x,  to 
obtain  the  quantile  function  x  —  F~l{p). 

Example  4.14  ( Quantile  Function  of  a  Uniform  Distribution)  For  a  uniform  vari¬ 
able  in  the  range  0-1,  the  quantile  function  has  a  particularly  simple  form.  In  fact, 
F(x)  —  x ,  and  the  quantile  function  defined  by  the  equation  p  —  F(x)  yields  x  —  p, 
and  therefore 


x  —  F  l(p)  =  p.  (4.40) 

Therefore  the  analytical  form  of  both  the  cumulative  distribution  and  the  quantile 
function  is  identical  for  the  uniform  variable  in  0-1,  meaning  that,  e.g.,  the  value 
0.75  of  the  random  variable  is  the  p  —  0.7 5,  or  75  %  quantile  of  the  distribution.  <> 

The  basic  property  of  the  quantile  function  can  be  stated  mathematically  as 

p  <  F(x)  <£>  v  <  F~l{p)  (4.41) 

meaning  that  the  value  of  F~l  (p)  is  the  value  v  at  which  the  probability  of  having 
X  >  x  is  p. 

Example  4.15  ( Quantile  Function  of  an  Exponential  Distribution)  Consider  a  ran¬ 
dom  variable  distributed  like  an  exponential, 

f(x)  =  Xe~Xx, 

with  v  >  0.  Its  cumulative  distribution  function  is 

F(x)  =  1  -  e~Xx. 

The  quantile  function  is  obtained  from, 


p  =  F(x)  =  i  -  <rAj\ 
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Random  Variable  X  Quantile  p 

Fig.  4.2  Distribution  function /(x),  cumulative  distribution  F(x),  and  quantile  function  F  1  (p)  of 
an  exponential  variable  with  A  =  l/ 2 


leading  to  x  =  ln(l  —  p)/{— A),  and  therefore  the  quantile  function  is 


x  =  F~l(p) 


ln(  1  —  p) 
-A 


Figure  4.2  shows  the  cumulative  distribution  and  the  quantile  function  for  the 
exponential  distribution.  <> 


4.8.1  General  Method  to  Simulate  a  Variable 

The  method  to  simulate  a  random  variable  is  summarized  in  the  following  equation, 

X  =  F~\U),  (4.42) 

which  states  that  any  random  variable  X  can  be  expressed  in  terms  of  the  uniform 
variable  U  between  0  and  1,  F  is  the  cumulative  distribution  of  the  variable  X , 
and  F~l  is  the  quantile  function.  If  a  closed  analytic  form  for  F  is  available  for 
that  distribution,  this  equation  results  in  a  simple  method  to  simulate  the  random 
variable. 

Proof  We  have  already  seen  that  for  the  uniform  variable  the  quantile 
function  is  F~l(U)  —  U ,  i.e.,  it  is  the  uniform  random  variable  itself.  The 
proof  therefore  simply  consists  of  showing  that,  assuming  (4.42),  then  the 
cumulative  distribution  of  X  is  indeed  F(X ),  or  P(X  <  x)  =  F(x).  This  can  be 
shown  by  writing 

P(x  <x)=  P(F~l(U)  <x)=  P(U  <  F(x))  =  Fix), 


4.8  The  Quantile  Function  and  Simulation  of  Random  Variables 


79 


in  which  the  second  equality  follows  from  the  definition  of  the  quantile 
function,  and  the  last  equality  follows  from  the  fact  that  P(U  <  u)  —  u, 
for  u  a  number  between  0  and  1,  for  a  uniform  variable.  □ 

Example  4.16  ( Simulation  of  an  Exponential  Variable )  Consider  a  random  variable 
distributed  like  an  exponential,  fix)  —  Xe~Xx,  x  >  0.  Given  the  calculations 
developed  in  the  example  above,  the  exponential  variable  can  be  simulated  as 


ln(l  -  U) 
^A 


Notice  that,  although  this  relationship  is  between  random  variables,  its  practical 
use  is  to  draw  random  samples  u  from  U,  and  a  random  sample  v  from  X  is  obtained 
by  simply  using  the  equation 


v  = 


ln(l  —  u) 
-A 


Therefore,  for  a  large  sample  of  values  u ,  the  above  equation  returns  a  random 
sample  of  values  for  the  exponential  variable  X.  O 

Example  4.17  ( Simulation  of  the  Square  of  Uniform  Variable )  It  can  be  proven 
that  the  simulation  of  the  square  of  a  uniform  random  variable  Y  —  U2  is  indeed 
achieved  by  squaring  samples  from  a  uniform  distribution,  a  very  intuitive  result. 

In  fact,  we  start  with  the  distribution  of  F  as  g(y)  =  l/i  y~1^2.  Since  its 
cumulative  distribution  is  given  by  G(y )  =  ^/y,  the  quantile  function  is  defined 
by  p  —  />fy,  or  y  —  p2  and  therefore  the  quantile  function  for  U 2  is 


y  —  G  l(p)=p2. 


This  result,  according  to  (4.42),  defines  U 2 ,  or  the  square  of  a  uniform  distribution, 
as  the  function  that  needs  to  be  simulated  to  draw  fair  samples  from  Y .  O 


4.8.2  Simulation  of  a  Gaussian  Variable 

This  method  of  simulation  of  random  variables  relies  on  the  knowledge  of  F(x)  and 
the  fact  that  such  a  function  is  analytic  and  invertible.  In  the  case  of  the  Gaussian 
distribution,  the  cumulative  distribution  function  is  a  special  function, 

1  Cx 

Fix)  —  —  /  e  2  dx 

which  cannot  be  inverted  analytically.  Therefore,  this  method  cannot  be  applied. 
This  complication  must  be  overcome,  given  the  importance  of  Gaussian  distribution 
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in  probability  and  statistics.  Fortunately,  a  relatively  simple  method  is  available 
that  permits  the  simulation  of  two  Gaussian  distributions  from  two  uniform  random 
variables. 

In  Sect.  4.4  we  showed  that  the  transformation  from  Cartesian  to  polar  coordi¬ 
nates  results  in  two  random  variables  R ,  G  that  are  distributed,  respectively,  like  a 
Rayleigh  and  a  uniform  distribution: 


h(r)  —  re 


m  = 


i 

2i x 


2 


r  >  0 

0  <  0  <  2tt. 


(4.43) 


Since  these  two  distributions  have  an  analytic  form  for  their  cumulative  distribu¬ 
tions,  R  and  G  can  be  easily  simulated.  We  can  then  use  the  transformation  given 
by  (4.7)  to  simulate  a  pair  of  independent  standard  Gaussians.  We  start  with  the 
Rayleigh  distribution,  for  which  the  cumulative  distribution  function  is 

F 

H(r)  =  l-e  2  . 

The  quantile  function  is  given  by 

p  =  l-e  2 , 


and  from  this  we  obtain 


r  =  sj -2  ln(l  —  p)  =  H  1  (p) 

and  therefore  R  —  21n(l  —  U)  simulates  a  Rayleigh  distribution,  given  the 

uniform  variable  U.  For  the  uniform  variable  G ,  it  is  clear  that  the  cumulative 
distribution  is  given  by 


0/  (2n)  0  <  0  <  2tt 

0  otherwise; 

the  quantile  function  is  6  —  2n :p  =  I~l(p ),  and  therefore  G  —  2nV  simulates  a 
uniform  distribution  between  0  and  2n,  with  V  the  uniform  distribution  between  0 
and  1. 

Therefore,  with  the  use  of  two  uniform  distributions  U,  V ,  we  can  use  R  and  G 
to  simulate  a  Rayleigh  and  a  uniform  angular  distribution 

R  =  7-2  In  (1  -  U) 

G  —  2ttV. 


(4.44) 
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Then,  using  the  Cartesian-Polar  coordinate  transformation,  we  arrive  at  the  formulas 
needed  to  simulate  a  pair  of  Gaussians  X  and  Y : 


X  =  R cos(0)  =  J- 21n(l  -  U)  •  cos(2 nV)  /A  A^ 

v  (4.45) 

Y  =  Xsin(<9)  =  — 2  ln(l  -  U)  ■  sin(2^V) 


Equations  (4.45)  can  be  easily  implemented  by  having  available  two  simultaneous 
and  independent  uniform  variables  between  0  and  1 . 


Summary  of  Key  Concepts  for  this  Chapter 

□  Linear  combination  of  variables:  The  formulas  for  the  mean  and  variance 
of  the  linear  combination  of  variables  are 


2  2  2  i  o  2 

*  =  E/=t  aiG t  +  2  E,-=t  E;=i+1  wPij 


□  Variance  of  uncorrelated  variables’.  When  variables  are  uncorrelated 
the  variances  add  linearly.  The  variance  of  the  mean  of  N  independent 
measurements  is  oj  —  o2 /N. 

□  Moment  generating  function:  It  is  a  mathematical  function  that  enables  the 
calculation  of  moments  of  a  distribution,  M(t)  —  E[eTX]. 

□  Central  Limit  theorem :  The  sum  of  a  large  number  of  independent 
variables  is  distributed  like  a  Gaussian  of  mean  equal  to  the  sum  of  the 
means  and  variance  equal  to  the  sum  of  the  variances. 

□  Method  of  change  of  variables:  A  method  to  obtain  the  distribution 
function  of  a  variable  Y  that  is  a  function  of  another  variable  X,  g(y)  = 
f(x)dx/dy. 

□  Law  of  Large  Numbers:  The  sum  of  a  large  number  of  random  variables 
with  mean  (i  tends  to  a  constant  number  equal  to  /x. 

□  Error  propagation  formula:  It  is  an  approximation  for  the  variance  of 
a  function  of  random  variables.  For  a  function  v  =  f(u,  v )  of  two 
uncorrelated  variables  U  and  V,  the  variance  of  X  is  given  by 


df 

du 


2 


□  Quantile  function:  It  is  the  function  v  =  L~l  (p)  used  to  find  the  value  v  of 
a  variable  that  corresponds  to  a  given  quantile  p. 

□  Simulation  of  a  Gaussian:  Two  Gaussians  can  be  obtained  from  two 
uniform  random  variables  U ,  V  via 


X  =  v/-21n(  l  -  U)  cos  (2 nV) 
Y  =  ^2M\^U)  sm(2nV) 
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Problems 

4.1  Consider  the  data  from  Thomson’s  experiment  of  Tube  1 ,  from  page  23. 

(a)  Calculate  the  mean  and  standard  deviation  of  the  measurements  of  v. 

(b)  Use  the  results  from  Problem  2.3,  in  which  the  mean  and  standard  deviation  of 
W/Q  and  I  were  calculated,  to  calculate  the  approximate  values  of  mean  and 
standard  deviation  of  v  using  the  relevant  error  propagation  formula,  assuming 
no  correlation  between  the  two  measurements. 

This  problem  illustrates  that  the  error  propagation  formulas  may  give  different 
results  than  direct  measurement  of  the  mean  and  variance  of  a  variable,  when  the 
individual  measurements  are  available. 

4.2  Calculate  the  mean,  variance,  and  moment  generating  function  M(t)  for  a 
uniform  random  variable  in  the  range  0-1 . 

4.3  Consider  two  uniform  independent  random  variables  X ,  Y  in  the  range  —  1 
to  1. 

(a)  Determine  the  distribution  function,  mean  and  variance,  and  the  moment 
generating  function  of  the  variables. 

(b)  We  speculate  that  the  sum  of  the  two  random  variables  is  distributed  like  a 
“triangular”  distribution  between  the  range  —2  to  2,  with  distribution  function 

l  +  -if— 2<x<0 
f®=  1  x 

- if  0  <  v  <  2 

12  4  “  “ 

Using  the  moment  generation  function,  prove  that  the  variable  Z  =  X  +  Y  is 
distributed  like  the  triangular  distribution  above. 

4.4  Using  a  computer  language  of  your  choice,  simulate  the  sum  of  N  —  100 
uniform  variables  in  the  range  0-1,  and  show  that  the  sampling  distribution  of  the 
sum  of  the  variables  is  approximately  described  by  a  Gaussian  distribution  with 
mean  equal  to  the  mean  of  the  N  uniform  variables  and  variance  equal  to  the  sum  of 
the  variances.  Use  1,000  and  100,000  samples  for  each  variable. 

4.5  Consider  the  J.J.  Thomson  experiment  of  page  23. 

(a)  Calculate  the  sample  mean  and  the  standard  deviation  of  m/e  for  Tube  1 . 

(b)  Calculate  the  approximate  mean  and  standard  deviation  of  m/e  from  the  mean 
and  standard  deviation  of  W/Q  and  /,  according  to  the  equation 

m  I2  Q 
~e  ~  ~2W’ 

Assume  that  W/Q  and  I  are  uncorrelated. 
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4.6  Use  the  data  provided  in  Example  4.1 1.  Calculate  the  probability  of  a  positive 
detection  of  source  counts  S  in  the  first  time  period  (where  there  are  N\  =  50 
total  counts  and  B  —  20  background  counts),  and  the  probability  that  the  source 
emitted  >10  source  counts.  You  will  need  to  assume  that  the  measured  variable  can 
be  approximated  by  a  Gaussian  distribution. 

4.7  Consider  the  data  in  the  Thomson  experiment  for  Tube  1  and  the  fact  that  the 
variables  W/Q  and  I  are  related  to  the  variable  v  via  the  relationship 

2W 
v  —  - . 

Ql 

Calculate  the  sample  mean  and  variance  of  v  from  the  direct  measurements  of  this 
variable,  and  then  using  the  measurements  of  W /  Q  and  I  and  the  error  propagation 
formulas.  By  comparison  of  the  two  estimates  of  the  variance,  determine  if  there  is 
a  positive  or  negative  correlation  between  W/Q  and  I. 

4.8  Provide  a  general  expression  for  the  error  propagation  formula  when  three 
independent  random  variables  are  present,  to  generalize  (4.24)  that  is  valid  for  two 
variables. 


Chapter  5 

Maximum  Likelihood  and  Other  Methods 
to  Estimate  Variables 


Abstract  In  this  chapter  we  study  the  problem  of  estimating  parameters  of  the 
distribution  function  of  a  random  variable  when  N  observations  of  the  variable 
are  available.  We  discuss  methods  that  establish  what  sample  quantities  must  be 
calculated  to  estimate  the  corresponding  parent  quantities.  This  establishes  a  firm 
theoretical  framework  that  justifies  the  definition  of  the  sample  variance  as  an 
unbiased  estimator  of  the  parent  variance,  and  the  sample  mean  as  an  estimator 
of  the  parent  mean.  One  of  these  methods,  the  maximum  likelihood  method,  will 
later  be  used  in  more  complex  applications  that  involve  the  fit  of  two-dimensional 
data  and  the  estimation  of  fit  parameters.  The  concepts  introduced  in  this  chapter 
constitute  the  core  of  the  statistical  techniques  for  the  analysis  of  scientific  data. 


5.1  The  Maximum  Likelihood  Method  for  Gaussian 
Variables 


Consider  a  random  variable  X  distributed  like  a  Gaussian.  The  probability  of  making 
a  measurement  between  X[  and  X[  +  dx  is  given  by 


f(xi)dx  = 


l  (Xj-fl)2 

e  2a2  dx. 

sflno1 


This  probability  describes  the  likelihood  of  collecting  the  data  point  xt  given  that 
the  distribution  has  a  fixed  value  of  \i  and  a.  Assume  now  that  N  measurements  of 
the  random  variable  have  been  made.  The  goal  is  to  estimate  the  most  likely  values 
of  the  true — yet  unknown — values  of  \i  and  a,  the  two  parameters  that  determine 
the  distribution  of  the  random  variable.  The  method  of  analysis  that  follows  this 
principle  is  called  the  maximum  likelihood  method.  The  method  is  based  on  the 
postulate  that  the  values  of  the  unknown  parameters  are  those  that  yield  a  maximum 
probability  of  observing  the  measured  data.  Assuming  that  the  measurements  are 
made  independently  of  one  another,  the  quantity 


N  N 


rc(x‘)=n 


( Xj  -  /i  )2 

^  2  a2 

s/lno1 


(5.1) 
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is  the  probability  of  making  N  independent  measurements  in  intervals  of  unit  length 
around  the  values  ;q,  which  can  be  viewed  as  the  probability  of  measuring  the  dataset 
composed  of  the  given  N  measurements. 

The  method  of  maximum  likelihood  consists  therefore  of  finding  the  parameters 
of  the  distribution  that  maximize  the  probability  in  (5.1).  This  is  simply  achieved 
by  finding  the  point  at  which  the  first  derivative  of  the  probability  P  with  respect  to 
the  relevant  parameter  of  interest  vanishes,  to  find  the  extremum  of  the  function.  It 
can  be  easily  proven  that  the  second  derivative  with  respect  to  the  two  parameters  is 
negative  at  the  point  of  extremum,  and  therefore  this  is  a  point  of  maximum  for  the 
likelihood  function. 


5.1.1  Estimate  of  the  Mean 

To  find  the  maximum-likelihood  estimate  of  the  mean  of  the  Gaussian  distribution 
we  proceed  with  the  calculation  of  the  first  derivative  of  In  P ,  instead  of  P ,  with 
respect  to  the  mean  /x.  Given  that  the  logarithm  is  a  monotonic  function  of  the 
argument,  maximization  of  In  P  is  equivalent  to  that  of  P,  and  the  logarithm  has  the 
advantage  of  ease  of  computation.  We  obtain 

9  (xi  —  p)1  _  ~ 

9/x^-  2a2  ~  ' 

i=  1 

The  solution  is  the  maximum-likelihood  estimator  of  the  mean,  which  we  define  as 
11ml ,  and  is  given  by 


1 

Mml  =  —  2_^x,  =  A\  (5.2) 

i=i 

This  result  was  to  be  expected:  the  maximum  likelihood  method  shows  that  the 
“best”  estimate  of  the  mean  is  simply  the  sample  average  of  the  measurements. 

The  quantity  /zml  is  a  quantity  that,  despite  the  Greek  letter  normally  reserved 
for  parent  quantities,  is  a  function  of  the  measurements.  Although  it  appears  obvious 
that  the  sample  average  is  the  correct  estimator  of  the  true  mean,  it  is  necessary  to 
prove  this  statement  by  calculating  its  expectation.  It  is  clear  that  the  expectation  of 
the  sample  average  is  in  fact 


E[x\  =  — E[xi  +  . . .  +  xN]  =  fi, 

This  calculation  is  used  to  conclude  that  the  sample  mean  is  an  unbiased  estimator 
of  the  true  mean. 
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5.1.2  Estimate  of  the  Variance 


Following  the  same  method  used  to  estimate  the  mean,  we  can  also  take  the  first 
derivative  of  In  P  with  respect  to  cr2,  to  obtain 


d 


da2 


Nln 


s/lno2 


+ 


d 

da2 


from  which  we  obtain 


1  1 

N,-2a- 


and  finally  the  result  that  the  maximum  likelihood  estimator  of  the  variance  is 


1  N 

-  y>,  -  tiy 

i=  1 


(5.3) 


It  is  necessary  to  notice  that  in  the  maximum  likelihood  estimate  of  the  variance 
we  have  implicitly  assumed  that  the  mean  \i  was  known,  while  in  reality  we  can 
only  estimate  it  as  the  sample  mean,  from  the  same  data  used  also  to  estimate  the 
variance.  To  account  for  the  fact  that  \i  is  not  known,  we  replace  it  with  x  in  (5.3), 
and  call 


SML  =  (5-4) 

1=1 

the  maximum  likelihood  sample  variance  estimator ,  which  differs  from  the  sample 
variance  defined  in  (2. 1 1)  by  a  factor  of  (N  —  l)/N.  The  fact  that  x  replaced  fi  in  its 
definition  leads  to  the  following  expectation: 


N  —  1 
N 


Proof  Calculation  of  the  expectation  is  obtained  as 


(5.5) 


E[s2ml]  =  £[2  y](x;  -  x)2]  =  M  M 

=  ^E[^2(xi  -  n)2  +  y](/i  -  x)2  +  2 (/i  - 


x)  yy*;  -  m)] 
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The  term  E[J2(/x  ~  x)2]  is  the  variance  of  the  sample  mean,  which  we 
know  from  Sect.  4.1  to  be  equal  to  a2 /N.  The  last  term  in  the  equation  is 
^2(xi  —  /i)  =  N(x  —  /x),  therefore: 

E[s2ML]  =  f  (£E(*i  -  M)2]  +  NE[(ji  -  X f]  +  2NE[{ji  -  x)(x  -  /i)]) 

=  j-  (Nor2  +  No1  /N  -  2NE\(ji  -  3c)2]) 
leading  to  the  result  that 

E[s2ml]  =  f  (Na2  +  Na2/N  -  2 Na2/N)  = 

In  this  proof  we  used  the  notation  a1  —  a^L  □ 

This  result  is  at  first  somewhat  surprising,  since  there  is  an  extra  factor  (N—l)/N 
that  makes  E[s2M L\  different  from  the  maximum  likelihood  estimator  of  a2.  This 
is  actually  due  to  the  fact  that,  in  estimating  the  variance,  the  mean  needed  to  be 
estimated  as  well  and  was  not  known  beforehand.  The  unbiased  estimator  of  the 
variance  is  therefore 


S2  =  SML  X  =  JfZX  “  V  (5'6) 

i=  1 

for  which  we  have  shown  that  E[s2]  —  a2.  This  is  the  reason  for  the  definition  of 
the  sample  variance  according  to  (5.6),  and  not  (5.4). 

It  is  important  to  pay  attention  to  the  fact  that  (5.6)  defines  a  statistic  for  which  we 
could  also  find,  in  addition  to  its  expectation,  also  its  variance,  similar  to  what  was 
done  for  the  sample  mean.  In  Chap.  7  we  will  study  how  to  determine  the  probability 
distribution  function  of  certain  statistics  of  common  use,  including  the  distribution 
of  the  sample  variance  s 2 . 

Example  5. 1  We  have  already  made  use  of  the  sample  mean  and  the  sample 
variance  as  estimators  for  the  parent  quantities  in  the  analysis  of  the  data  from 
Thomson’s  experiment  (page  23).  The  estimates  we  obtained  are  unbiased  if  the 
assumptions  of  the  maximum  likelihood  method  are  satisfied,  namely  that  I  and 
W IQ  are  Gaussian  distributed.  <> 


5.1.3  Estimate  of  Mean  for  Non-uniform  Uncertainties 

In  the  previous  sections  we  assumed  a  set  of  measurements  jq  of  the  same  random 
variable,  i.e.,  the  parent  mean  /x  and  variance  a2  were  the  same.  It  is  often  the  case 
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with  real  datasets  that  observations  are  made  from  variables  with  the  same  mean, 
but  with  different  variance.  This  could  be  the  case  when  certain  measurements  are 
more  precise  than  others,  and  therefore  they  feature  the  same  mean  (since  they  are 
drawn  from  the  same  process),  but  the  standard  error  varies  with  the  precision  of  the 
instrument,  or  because  some  measurements  were  performed  for  a  longer  period  of 
time.  In  this  case,  each  measurement  X;  is  assigned  a  different  standard  deviation  07, 
which  represents  the  precision  with  which  that  measurement  was  made. 

Example  5.2  A  detector  is  used  to  measure  the  rate  of  arrival  of  a  certain  species  of 
particles.  One  measurement  consists  of  100  counts  in  10  s,  another  of  180  particles 
in  20  s,  and  one  of  33  particles  in  3  s.  The  measured  count  rates  would  be  reported 
as,  respectively,  10.0,  9.0,  and  1 1.0  counts  per  second.  Given  that  this  is  a  counting 
experiment,  the  Poisson  distribution  applies  to  each  of  the  measurements.  Moreover, 
since  the  number  of  counts  is  sufficiently  large,  it  is  reasonable  to  approximate  the 
Poisson  distribution  with  a  Gaussian,  with  variance  equal  to  the  mean.  Therefore 
the  variance  of  the  counts  is  100,  180,  and  33,  and  the  variance  of  the  count  rate 
can  be  calculated  by  the  property  that  Var[X/t]  =  Var[X\/t 2,  where  t  is  the  known 
time  of  each  measurement.  It  follows  that  the  standard  deviation  a  of  the  count 
rates  is,  respectively,  1.0,  0.67,  and  1.91  for  the  three  measurements.  The  three 
measurements  would  be  reported  as  10.0  ±  1.0,  9.0  ±  0.67,  and  11.0  ±  1.91,  with 
the  last  measurement  being  clearly  of  lower  precision  because  of  the  shorter  period 
of  observation.  O 

Our  goal  is  therefore  now  focused  on  the  maximum  likelihood  estimate  of  the 
parent  mean  /x,  which  is  the  same  for  all  measurements.  This  is  achieved  by  using 
(5.1)  in  which  the  parent  mean  of  each  measurement  is  /x,  and  the  parent  variance 
of  each  measurement  is  07 .  Following  the  same  procedure  as  in  the  case  of  equal 
standard  deviations  for  each  measurement,  we  start  with  the  probability  P  of  making 
the  N  measurements, 


N 


N 


p = n  pte) = n 


i=  1 


i=  1 


(xj-li)2 


(5.7) 


Setting  the  derivative  of  InP  with  respect  to  /x — the  common  mean  to  all 
measurements — equal  to  zero,  we  obtain  that  the  maximum  likelihood  estimates  is 


P'ML 


EiLte/g2) 

Ef=ia/of) 


(5.8) 


This  is  the  weighted  mean  of  the  measurements,  where  the  weights  are  the  inverse 
of  the  variance  of  each  measurement,  1/of. 
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The  variance  in  this  weighted  sample  mean  can  be  calculated  using  the  expecta¬ 
tion  of  the  weighted  mean,  by  assuming  that  the  of  are  constant  numbers: 


Var(fiML)  — 


which  results  in 


Ef=1d /of) 

(Ef=i(i/V>)2’ 


£f=i(i /of) 


(5.9) 


The  variance  of  the  weighted  mean  on  (5.9)  becomes  the  usual  o2/N  if  all  variances 
of  are  identical. 

Example  5.3  Continuing  Example  5.2  of  the  count  rate  of  particle  arrivals,  we  use 
(5.8)  and  (5.9)  to  calculate  a  weighted  mean  and  standard  deviation  of  9.44  and 
0.53.  Since  the  interest  is  just  in  the  overall  mean  of  the  rate,  the  more  direct  means 
to  obtain  this  number  is  by  counting  a  total  of  313  counts  in  33  s,  for  an  overall 
measurement  of  the  count  rate  of  9.48  ±  0.54,  which  is  virtually  identical  to  that 
obtained  using  the  weighted  mean  and  its  variance.  O 

It  is  common,  as  in  the  example  above,  to  assume  that  the  parent  variance  of  is 
equal  to  the  value  estimated  from  the  measurements  themselves.  This  approximation 
is  necessary,  unless  the  actual  precision  of  the  measurement  is  known  beforehand 
by  some  other  means,  for  example  because  the  apparatus  used  for  the  experiment 
has  been  calibrated  by  prior  measurements. 


5.2  The  Maximum  Likelihood  Method  for  Other 
Distributions 


The  method  of  maximum  likelihood  can  also  be  applied  when  the  measurements  do 
not  follow  a  Gaussian  distribution. 

A  typical  case  is  that  of  N  measurements  i  —  1, . . . ,  A,  from  a  Poisson 
variable  N  of  parameter  A,  applicable  to  all  situations  in  which  the  measurements  are 
derived  from  a  counting  experiment.  In  this  case,  the  maximum  likelihood  method 
can  be  used  to  estimate  A,  which  is  the  mean  of  the  random  variable,  and  the  only 
parameter  of  the  Poisson  distribution. 

The  Poisson  distribution  is  discrete  in  nature  and  the  probability  of  making  N 
independent  measurements  is  simply  given  by 


'=n 


Xn‘ 

«,! 
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It  is  convenient  to  work  with  logarithms, 

N  N  .  N  N 

In  P  —  ln  e~X  +  ^2  ^ 2  ~  A  —  AfA  +  In  A  n, 

i=  1  i=  1  **  i=  1  z=  1 

in  which  A  is  a  term  that  doesn’t  depend  on  A.  The  condition  that  the  probability 
must  be  maximum  requires  dP/d A  =  0.  This  condition  results  in 

1  N 

-J2xi~N  =  0, 

i=  1 

and  therefore  we  obtain  that  the  maximum  likelihood  estimator  of  the  A  parameter 
of  the  Poisson  distribution  is 


1 

Xml  =  v 

i=  1 

This  result  was  to  be  expected,  since  A  is  the  mean  of  the  Poisson  distribution,  and 
the  linear  average  of  N  measurements  is  an  unbiased  estimate  of  the  mean  of  a 
random  variable,  according  to  the  Law  of  Large  Numbers. 

The  maximum  likelihood  method  can  in  general  be  used  for  any  type  of 
distribution,  although  often  the  calculations  can  be  mathematically  challenging  if 
the  distribution  is  not  a  Gaussian. 


5.3  Method  of  Moments 

The  method  of  moments  takes  a  more  practical  approach  to  the  estimate  of  the 
parameters  of  a  distribution  function.  Consider  a  random  variable  X  for  which  we 
have  N  measurements  and  whose  probability  distribution  function  f(x)  depends  on 
M  unknown  parameters,  for  example  0\  —  fi  and  62  —  o2  for  a  Gaussian  (M  =  2), 
or  0\  =  A  for  an  exponential  (M  =  1),  etc.  The  idea  is  to  develop  a  method  that 
yields  as  many  equations  as  there  are  free  parameters  and  solve  for  the  parameters 
of  the  distribution.  The  method  starts  with  the  determination  of  arbitrary  functions 
dj(x),j  =  1, . . .  M,  that  make  the  distribution  function  integrable: 

/oo 

aj(x)f  (x)dx  =  gj{6)  (5.10) 

-OO 

where  gj(0 )  is  an  analytic  function  of  the  parameters  of  the  distribution.  Although 
we  have  assumed  that  the  random  variable  is  continuous,  the  method  can  also  be 
applied  to  discrete  distributions.  According  to  the  law  of  large  numbers,  the  left- 
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hand  side  of  (5.10)  can  be  approximated  by  the  sample  mean  of  the  function  of  the 
N  measurements,  and  therefore  we  obtain  a  linear  system  of  M  equations: 

jj(aj(x  0  +  •  •  •  +  aj{xN))  =  gj{9)  (5.11) 

which  can  be  solved  for  the  parameters  0  as  function  of  the  N  measurements  xt. 

As  an  illustration  of  the  method,  consider  the  case  in  which  the  parent  distribution 
is  a  Gaussian  of  parameters  /z,  cr2.  First,  we  need  to  decide  which  functions  a\(x) 
and  <22 (x)  to  choose.  A  simple  and  logical  choice  is  to  use  a\  (v)  =  v  and  <22 (v)  =  v2; 
this  choice  is  what  gives  the  name  of  “moments,”  since  the  right-hand  side  of  (5.10) 
will  be,  respectively,  the  first  and  second  order  moment.  Therefore  we  obtain  the 
two  equations 


E[a\(X)\  —  —  (X 1  +  . . .  +  Xn)  —  /z 

< 

E[ci2(X)\  —  —(X2  +  . . .  +  Xfy  —  <72  +  /z2 . 
v  TV 

The  estimator  for  mean  and  variance  are  therefore 


(5.12) 


l^MM  ~  —  (Xi  +  .  .  .  +  XN) 

*  1  /  1  \2  1 
gmm  —  ^ (Xj  +  . . .  +  x2)  —  (  —  (Xi  +  . . .  +  XN)  J  —  —  J2(xi  —  Ixmm)2 

(5.13) 

which,  in  this  case,  are  identical  to  the  estimates  obtained  from  the  likelihood 
method.  This  method  is  often  easier  computationally  than  the  method  of  maximum 
likelihood,  since  it  does  not  require  the  maximization  of  a  function,  but  just  a  careful 
choice  of  the  integrating  functions  cij(x).  Also,  notice  that  in  this  application  we 
did  not  make  explicit  use  of  the  assumption  that  the  distribution  is  a  Gaussian, 
since  the  same  results  will  apply  to  any  distribution  function  with  mean  /z  and 
variance  a2.  Equation  (5.13)  can  therefore  be  used  in  a  variety  of  situations  in  which 
the  distribution  function  has  parameters  that  are  related  to  the  mean  and  variance, 
even  if  they  are  not  identical  to  them,  as  in  the  case  of  the  Gaussian.  The  method 
of  moments  therefore  returns  unbiased  estimates  for  the  mean  and  variance  of  every 
distribution  in  the  case  of  a  large  number  of  measurements. 

Example  5.4  Consider  the  five  measurements  presented  in  Example  4.9:  0.1,  0.3, 
0.5,  0.7,  and  0.9,  and  assume  that  they  are  known  to  be  drawn  from  a  uniform 
distribution  between  0  and  a.  The  method  of  moments  can  be  used  to  estimate  the 
parameter  a  of  the  distribution  from  the  measurements.  The  probability  distribution 
function  is  f(x)  —  1/ a  between  0  and  <2,  and  null  otherwise.  Using  the  integrating 
function  a\(x)  —  jc,  the  method  of  moments  proceeds  with  the  calculation  of  the 
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first  moment  of  the  distribution, 


a 


xf(x)dx  —  — 


Therefore,  using  (5.12),  we  can  estimate  the  only  parameter  a  of  the  distribution 
function  as 


1 

a  —  2  x  — 
N 


N 


Xi 


/'=  1 


where  N  =  5,  for  the  result  of  a  —  1 .  The  result  confirms  that  the  five  measurements 
are  compatible  with  a  parent  mean  of  l/i.  <> 


5.4  Quantiles  and  Confidence  Intervals 

The  parameters  of  the  distribution  function  can  be  used  to  determine  the  range 
of  values  that  include  a  given  probability,  for  example,  68.3  %,  or  90%,  or  99  %, 
etc.  This  range,  called  confidence  interval ,  can  be  conveniently  described  by  the 
cumulative  distribution  function  F(x). 

Define  the  o' -quantile  xa,  where  a  is  a  number  between  0  and  1,  as  the  value  of 
the  variable  such  that  x  <  xa  with  probability  a: 

a  quantile  xa  :  P(x  <  xa  —  a)  or  F(xa)  =  a.  (5.14) 

For  example,  consider  the  cumulative  distribution  shown  in  Fig.  5.1  (right  panel): 
the  a  —  0.05  quantile  is  the  number  of  the  variable  v  where  the  lower  horizontal 
dashed  line  intersects  the  cumulative  distribution  F(x),xa  ~  0.2,  and  the  /3  =  0.95 
quantile  is  the  number  of  the  variable  v  where  the  upper  dashed  line  intersects  F(x), 
xp  ~  6.  Therefore  the  range  xa  to  xp,  or  0.2-6,  corresponds  to  the  (/3  —  a)  =  90  % 
confidence  interval,  i.e.,  there  is  90  %  probability  that  a  measurement  of  the  variable 
falls  in  that  range.  These  confidence  intervals  are  called  central  because  they  are 
centered  at  the  mean  (or  median)  of  the  distribution,  and  are  the  most  commonly 
used  type  of  confidence  intervals. 

Confidence  intervals  can  be  constructed  at  any  confidence  level  desired,  depend¬ 
ing  on  applications  and  on  the  value  of  probability  that  the  analyzer  wishes  to 
include  in  that  interval.  It  is  common  to  use  68  %  confidence  intervals  because 
this  is  the  probability  between  =ba  of  the  mean  for  a  Gaussian  variable  (see 
Sect.  5.4.1).  Normally  a  confidence  interval  or  limit  at  a  significance  lower  than 
68  %  is  not  considered  interesting,  since  there  is  a  significant  probability  that  the 
random  variable  will  be  outside  of  this  range. 

One-sided  confidence  intervals  that  extends  down  to  —  oo,  or  to  the  lowest  value 
allowed  for  that  random  variable,  is  called  an  upper  limit ,  and  intervals  that  extend 
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CD 


Random  Variable  X  Random  Variable  X 

Fig.  5.1  (Left)  Distribution  function  of  an  exponential  variable  with  central  68  and  90  %  confi¬ 
dence  interval  marked  by,  respectively,  dot-dashed  and  dotted  lines.  (Right)  The  confidence  interval 
are  obtained  as  the  intersection  of  dot-dashed  and  dotted  lines  with  the  cumulative  distribution 
(solid  line) 


to  +oo,  or  to  the  highest  allowed  value,  is  called  a  lower  limit.  A  lower  limit 
describes  a  situation  in  which  a  large  number  is  detected,  for  example  counts  from 
a  Poisson  experiment,  and  we  want  to  describe  how  small  the  value  of  the  variable 
can  be,  and  still  be  consistent  with  the  data.  An  upper  limit  is  used  for  a  situation 
in  which  a  small  number  is  detected,  to  describe  how  high  can  the  variable  be  and 
still  be  consistent  with  the  data.  Lower  and  upper  limits  depend  on  the  value  of  the 
probability  that  we  want  to  use;  for  example,  using  a  value  for  a  that  is  closer  to  0 
results  in  a  lower  limit  that  progressively  becomes  A i0  =  —  oo  (or  lowest  allowed 
value),  which  is  not  a  very  interesting  statement.  If  /3  is  progressively  closer  to  1, 
the  upper  limit  will  tend  to  Xup  =  oo. 


5.4.1  Confidence  Intervals  for  a  Gaussian  Variable 

When  the  variable  is  described  by  a  Gaussian  distribution  function  we  can  use 
integral  tables  (Table  A. 2)  to  determine  confidence  intervals  that  enclose  a  given 
probability.  It  is  usually  meaningful  to  have  central  confidence  intervals,  i.e., 
intervals  centered  at  the  mean  of  the  distribution  and  extending  by  equal  amounts  on 
either  side  of  the  mean.  For  central  confidence  intervals,  the  relationship  between 
the  probability  p  enclosed  by  a  given  interval  (say  p  =  0.9  or  90%  confidence 
interval)  and  the  size  Ax  =  2(z  x  a)  of  the  interval  is  given  by 
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where /(v)  is  a  Gaussian  of  mean  /z  and  variance  a2.  The  number  z  represents  the 
number  of  standard  deviations  allowed  by  the  interval  in  each  direction  (positive 
and  negative  relative  to  the  mean).  The  most  common  central  confidence  intervals 
for  a  Gaussian  distribution  are  reported  in  Table  5.1.  For  example,  for  a  mean  \i 
and  variance  a2  and  the  interval  from  /z  —  a  to  /z  +  a  is  a  68.3  %  confidence 
interval,  the  interval  from  /z  —  1.65a  to  /z  +  1.65a  is  a  90%  confidence  interval. 
In  principle,  one  could  have  confidence  intervals  that  are  not  centered  on  the  mean 
of  the  distribution — such  intervals  would  still  be  valid  confidence  intervals.  It  can 
be  shown  that  central  confidence  intervals  are  the  smallest  possible,  for  a  given 
confidence  level. 

Example  5.5  Using  the  data  for  the  J.J.  Thomson  experiment  on  the  measurement 
of  the  electron’s  mass-to-charge  ratio,  we  can  calculate  the  90  %  confidence  interval 
on  m/e  for  Tube  1  and  Tube  2.  For  Tube  1,  we  estimated  the  mean  as  /zi  =  0.42 
and  the  standard  error  as  o\  —  0.07,  and  for  Tube  2  /Z2  =  0.53  and  a  =  0.08. 
Since  the  random  variable  is  assumed  to  be  Gaussian,  the  90  %  confidence  interval 
corresponds  to  the  range  between  /z  —  1.65a  and  /z  +  1.65a;  therefore  for  the 
Thomson  measurements  of  Tube  1  and  Tube  2,  the  90  %  central  confidence  intervals 
are,  respectively,  0.30-0.54  and  0.40-0.66.  O 

Upper  and  lower  limits  can  be  easily  calculated  using  the  estimates  of  /z  and 
a  for  a  Gaussian  variable.  They  are  obtained  numerically  from  the  following 
relationships, 


r^up 

P=  f(x)dx  =  F(Xup) 

J — oo 

poo 

upper  limit  A  up 

(5.16) 

P=  f  (x)dx  =  1  -  F(  Afo) 

J \}o 

lower  limit  A  i0 

making  use  of  Tables  A.2  and  A. 3.  The  quantites  F{Xup)  and  F(Xi0 )  are  the  values 
of  the  cumulative  distribution  of  the  Gaussian,  showing  that  Xup  is  the  p-quantile 
and  Xi0  is  the  (1  —  p) -quantile  of  the  distribution.  Useful  upper  and  lower  limits 
for  the  Gaussian  distribution  are  reported  in  Table  5.2.Upper  limits  are  typically  of 
interest  when  the  measurements  result  in  a  low  value  of  the  mean  of  the  variable. 
In  this  case  we  usually  want  to  know  how  high  the  variable  can  be  and  still  be 


Table  5.1  Common  confidence  intervals  for  a  Gaussian  distribution 


Interval 

Range 

Enclosed  probability  (%) 

50  %  confidence  interval 

/z  —  0.68a,  /z  +  0.68a 

50 

l-cr  interval 

p  —  a,  p  a 

68.3 

90  %  confidence  interval 

/z  —  1.65a,  /z  +  1.65a 

90 

2-cr  interval 

/z  —  2a,  /z  +  2 a 

95.5 

3-cr  interval 

/z  —  3a,  /z  +  3a 

99.7 
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Table  5.2  Common  upper  and  lower  limits  for  a  Gaussian  distribution 


Upper  limit 

Range 

Enclosed 

probability  (%) 

Lower  limit 

Range 

Enclosed 

probability  (%) 

50  %  confidence 

<  /x 

50 

50  %  confidence 

>  /X 

50 

90  %  confidence 

<  ji  +  1.28a 

90 

90  %  confidence 

>  /x  —  1.28a 

90 

95  %  confidence 

<  /x  4-  1.65a 

95 

95  %  confidence 

>  ji  —  1.65a 

95 

99  %  confidence 

<  ji  +  2.33a 

99 

99  %  confidence 

>  /x  —  2.33a 

99 

1-cr 

<  /x  +  a 

84.1 

1-a 

IV 

1= 

1 

Q 

84.1 

2-<t 

<  /x  +  2a 

97.7 

2-a 

>  /x  —  2a 

97.7 

3-a 

<  /x  +  3a 

99.9 

3-a 

>  /x  —  3a 

99.9 

consistent  with  the  measurement,  at  a  given  confidence  level.  For  example,  in  the 
case  of  the  measurement  of  Tube  1  for  the  Thomson  experiment,  the  variable  m/ e 
was  measured  to  be  0.42  ±  0.07.  In  this  case,  it  is  interesting  to  ask  the  question 
of  how  high  can  m/e  be  and  still  be  consistent  with  the  measurement  at  a  given 
confidence  level. 

Example  5.6  Using  the  data  for  the  J.J.  Thomson  experiment  on  the  measurement 
of  the  electron’s  mass-to-charge  ratio,  we  can  calculate  the  90  %  upper  limits  to  m/e 
for  Tube  1  and  Tube  2.  For  Tube  1,  we  estimated  the  mean  as  —  0.42  and  the 
standard  error  as  o\  —  0.07,  and  for  Tube  2  {i^  —  0.53  and  o  —  0.08. 

To  determine  the  upper  limit  m/ ejjL.90  of  the  ratio,  we  calculate  the  probability 
of  occurrence  of  m/e  <  m/eUL^: 


P{m/e  <  m/ejjL, 90)  =  0.90 


Since  the  random  variable  is  assumed  to  be  Gaussian,  the  value  v  2 2  /x  +  1.28a 
corresponds  to  the  90  percentile  of  the  distribution  (see  Table  5.2).  The  two  90% 
upper  limits  are,  respectively,  0.5 1  and  0.63.  O 

A  common  application  of  upper  limits  is  when  an  experiment  has  failed  to 
detect  the  variable  of  interest.  In  this  case  we  have  a  non-detection  and  we  want 
to  place  upper  limits  based  on  the  measurements  made.  This  problem  is  addressed 
by  considering  the  parent  distribution  of  the  variable  that  we  did  not  detect,  for 
example  a  Gaussian  of  zero  mean  and  given  variance.  We  determine  the  upper  limit 
as  the  value  of  the  variable  that  exceeds  the  mean  by  1,  2  or  3  a,  corresponding 
to  the  probability  levels  shown  in  Table  5.2.  A  3-a  upper  limit,  for  example,  is  the 
value  of  the  variable  that  has  only  a  0.1  %  chance  of  being  observed  based  on  the 
parent  distribution  for  the  non-detection,  and  therefore  we  are  99.9  %  confident  that 
the  true  value  of  the  variable  is  lower  than  this  upper  limit. 

Example  5.7  ( Gaussian  Upper  Limit  to  the  Non-detection  of  a  Source )  A  measure¬ 
ment  of  n  =  8  counts  in  a  given  time  interval  is  made  in  the  presence  of  a  source 
of  unknown  intensity.  The  instrument  used  for  the  measurement  has  a  background 
level  with  a  mean  of  9.8  ±  0.4  counts,  as  estimated  from  an  independent  experiment 
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of  long  duration.  Given  that  the  measurement  is  below  the  expected  background 
level,  it  is  evident  that  there  is  no  positive  detection  of  the  source.  The  hypothesis 
that  the  source  has  zero  emission  can  be  described  by  a  distribution  function  with 
a  mean  of  approximately  9.8  counts  and,  since  this  is  a  counting  experiment,  the 
probability  distribution  of  counts  should  be  Poisson.  We  are  willing  to  approximate 
the  distribution  with  a  Gaussian  of  same  mean,  and  variance  equal  to  the  mean,  or 
a  22  3.1,  to  describe  the  distribution  of  counts  one  expects  from  an  experiment  of 
the  given  duration  as  the  one  that  yielded  n  —  8  counts. 

A  99%  upper  limit  to  the  number  of  counts  that  can  be  recorded  by  this 
instrument,  in  the  given  time  interval,  can  be  calculated  according  to  Table  5.2  as 

(i  +  2.33a  =  9.8  +  2.33  x  3.1  22  17. 

This  means  that  we  are  99%  confident  that  the  true  value  of  the  source  plus 
background  counts  is  less  than  17.  A  complementary  way  to  interpret  this  number 
is  that  the  experimenter  can  be  99  %  sure  that  the  measurement  cannot  be  due  to 
just  the  background  if  there  is  a  detection  of  >17  total  counts.  A  conservative 
analyst  might  also  want  to  include  the  possibility  that  the  Gaussian  distribution 
has  a  slightly  higher  mean,  since  the  level  of  the  background  is  not  known  exactly, 
and  conservatively  assume  that  perhaps  1 8  counts  are  required  to  establish  that  the 
source  does  have  a  positive  level  of  emission.  After  subtraction  of  the  assumed 
background  level,  we  can  conclude  that  the  99  %  upper  limit  to  the  source’s  true 
emission  level  in  the  time  interval  is  8.2  counts.  This  example  was  adapted  from  the 
analysis  of  an  astronomical  source  that  resulted  in  a  non-detection  [4] .  O 


5.4.2  Confidence  Intervals  for  the  Mean  of  a  Poisson  Variable 

The  Poisson  distribution  does  not  have  the  simple  analytical  properties  of  the 
Gaussian  distribution.  For  this  distribution  it  is  convenient  to  follow  a  different 
method  to  determine  its  confidence  intervals. 

Consider  the  case  of  a  single  measurement  of  a  Poisson  variable  of  unknown 
mean  A  for  which  n0t,s  was  recorded.  We  want  to  make  inferences  on  the  parent 
mean  based  on  this  information.  Also,  we  assume  that  the  measurement  includes 
a  uniform  and  known  background  A B.  The  measurement  is  therefore  drawn  from  a 
random  variable 


X  =  NS+NB  (5.17) 

in  which  NB  =  A#  is  assumed  to  be  a  constant,  i.e.,  the  background  is  known  exactly 
(this  generalization  can  be  bypassed  by  simply  setting  XB  —  0).  The  probability 
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distribution  function  of  X  (the  total  source  plus  background  counts)  is 

f(n)  =  (A  +  Ab)  e-(A+Afi);  (5.18) 

n ! 

where  n  is  an  integer  number  describing  possible  values  of  X.  Equation  (5.18)  is  true 
even  if  the  background  is  not  known  exactly,  since  the  sum  of  two  Poisson  variables 
is  also  a  Poisson  variable  with  mean  equal  to  the  sum  of  the  means.  It  is  evident  that, 
given  the  only  measurement  available,  the  estimate  of  the  source  mean  is 


A  —  Mobs  Ag. 


This  estimate  is  the  starting  point  to  determine  a  confidence  interval  for  the  parent 
mean.  We  define  the  lower  limit  A iQ  as  the  value  of  the  source  mean  that  results  in 
the  observation  of  n  >  n0bs  with  a  probability  a: 


00  i  \  \n  nobs  1  |  ”1  \n 

a  =  wo  +  kB)  c-{\i0+yB )  =  x  (Afo  +  Ag)  -(Ato+A») 

n\  n\ 

n  w0bs  n — 0 


(5.19) 


The  mean  A i0  corresponds  to  the  situation  shown  in  the  left  panel  of  Fig.  5.2: 
assuming  that  the  actual  mean  is  as  low  as  A i0,  there  is  only  a  small  probability 
a  (say  5  %)  to  make  a  measurement  above  or  equal  to  what  was  actually  measured. 
Thus,  we  can  say  that  there  is  only  a  very  small  chance  (a)  that  the  actual  mean 
could  have  been  as  low  (or  lower)  than  A \0.  The  quantity  A iQ  is  the  lower  limit  with 
confidence  (1  —  a),  i.e.,  we  are  (1  —  a),  say  95  %,  confident  that  the  mean  is  higher 
than  this  value. 


lO 


Random  Variable  X  Random  Variable  X 


Fig.  5.2  This  illustration  of  the  upper  and  lower  limits  to  the  measurement  of  a  Poisson  mean 
assumes  a  measurement  of  nQbs  =  3.  On  the  left ,  the  lower  limit  to  the  parent  mean  is  such  that 
there  is  a  probability  of  a  to  measure  n0bs  or  higher  {hatched  area);  on  the  right,  the  upper  limit 
leaves  a  probability  of  (1  —  ft  )  that  a  measurement  is  nQbs  or  lower 
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By  the  same  logic  we  also  define  Xup  as  the  parent  value  of  the  source  mean  that 
results  in  the  observation  of  n  <  n0bs  with  a  probability  (1  —  /3),  or 


nobs  /T  I  )  \n 

1  -  p  =  y  +  e-(Xul,+xB)'  (5.20) 

n\ 

n= 0 

This  is  illustrated  in  the  right  panel  of  Fig.  5.2,  where  the  number  (1  —  /3)  is  intended 
as  a  small  number,  of  same  magnitude  as  a.  Assuming  that  the  mean  is  as  high  as 
A  Up,  there  is  a  small  probability  of  1  —  ^  to  make  a  measurement  equal  or  lower  than 
the  actual  measurement.  Therefore  we  say  that  there  is  only  a  small  probability  that 
the  true  mean  could  be  as  high  or  higher  than  Xup.  The  number  Xup  is  the  upper  limit 
with  confidence  /3,  that  is,  we  are  fi  (say  95  %)  confident  that  the  mean  is  lower  than 
this  value. 

If  we  combine  the  two  limits,  the  probability  that  the  true  mean  is  above  Xup  or 
below  A i0  is  just  (1  —  /3)  +  a,  say  10  %,  and  therefore  the  interval  A i0  to  Xup  includes 
a  probability  of 


P(Xi0  <  A  <  Xup)  =  1  -  (1  -  P)  -  a  =  fi>  -  a, 


i.e.,  this  is  a  (/3  —  a),  say  90  %,  confidence  interval. 

The  upper  and  lower  limits  defined  by  (5.19)  and  (5.20)  can  be  approximated 
analytically  using  a  relationship  that  relates  the  Poisson  sum  with  an  analytic 
distribution  function: 


Mobs  1 

E 


x=0 


<rAAx 

x\ 


1  -px2(x2, 0 


(5.21) 


where  Px 2(/2,  v)  is  the  cumulative  distribution  of  the  probability  distribu¬ 
tion  function  defined  in  Sect.  7.2,  with  parameters  yfi  —  2A  and  v  =  2 n0bs, 

9  [*2 

Pxi(x  ,v)  —  /  fx2(x,v)dx. 

J—oo 

The  approximation  is  due  to  Gehrels  [16],  and  makes  use  of  mathematical  relation¬ 
ships  that  can  be  found  in  the  handbook  of  Abramowitz  and  Stegun  [1].  The  result 
is  that  the  upper  and  lower  limits  can  be  simply  approximated  once  we  specify 
the  number  of  counts  n0bs  and  the  probability  level  of  the  upper  or  lower  limit. 
The  probability  level  is  described  by  the  number  S ,  which  is  the  equivalent  number 
of  Gaussian  a  that  corresponds  to  the  confidence  level  chosen  (for  example,  84  % 
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confidence  interval  corresponds  to  S  =  1,  etc.  see  Table  5.3) 


S2  +  3  /  3 

X  up  —  Hobs  ”t“  T  ”t“  T 

,  (1_  1  S'3 

A'lo  —  Mobs  I  t 


(5.22) 


9  blobs  5  blobs 


The  S  parameter  is  also  a  quantile  of  a  standard  Gaussian  distribution,  enclosing  a 
probability  as  illustrated  in  Table  5.3. 

Proof  Use  of  (5.21)  into  (5.19)  and  (5.20)  gives  a  relationship  between  the 
function  Pxi  and  the  probability  levels  a  and  /3, 


^^2  (2A/0 ,  2fi0^)  oc 

Px2(2Xup,2n0bs  +  2)  =  /3, 


(5.23) 


We  use  the  simplest  approximation  for  the  function  Pxi  described  in  [16],  one 
that  is  guaranteed  to  give  limits  that  are  accurate  within  10  %  of  the  true  values. 
The  approximation  makes  use  of  the  following  definitions:  for  any  probability 
a  <  1,  ya  is  the  a-quantile  of  a  standard  normal  distribution,  or  G(ya )  =  <2, 


G(ya)  = 


1  fya 

\[2jX  J — oc 


—t 


t2dt. 


If  Px 2(Xa>  v)  —  a>  ^en  the  simplest  approximation  between  /l  and  ya  given 
by  Gehrels  [16]  is 


X2a  ^  \  (ya  +  V2v  -  l)2  .  (5.24) 

Consider  the  upper  limit  in  (5.23).  We  can  solve  for  Xup  by  using  (5.24)  with 
2 A up  —  v  =  2 n0bs  +  2  and  S  =  ya,  since  ya  is  the  a-quantile  of  a  standard 


Table  5.3  Poisson  parameters  S  and  corresponding  probabilities 


Upper  or  lower  limit 

Range 

Probability  (%) 

Poisson  S  parameter 

90  %  confidence 

<  /i  +  1.28a 

90 

1.28 

95  %  confidence 

<  //  +  1.65a 

95 

1.65 

99  %  confidence 

<  /.x  +  2.33a 

99 

2.33 

1-cr 

<  /.x  +  a 

84.1 

1.0 

2-o 

<  //  +  2a 

97.7 

2.0 

3-o 

<  //  +  3a 

99.9 

3.0 
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normal  distribution,  thus  equivalent  to  5.  It  follows  that 

+  V  4  riots  +  3^ 

and  from  this  the  top  part  of  (5.22)  after  a  simple  algebraic  manipulation. 

A  similar  result  applies  for  the  lower  limit.  □ 

Equation  (5.22)  is  tabulated  in  Tables  A. 5  and  A. 6  for  several  interesting  values 
of  n0bs  and  S.  A  few  cases  of  common  use  are  also  shown  in  Table  5.4. 

Example  5.8  (Poisson  Upper  Limit  to  Non-detection  with  No  Background )  An 
interesting  situation  that  can  be  solved  analytically  is  that  corresponding  to  the 
situation  in  which  there  was  a  complete  non-detection  of  a  source,  n0bs  —  0. 
Naturally,  it  is  not  meaningful  to  look  for  a  lower  limit  to  the  Poisson  mean,  but  it  is 
quite  interesting  to  solve  (5.20)  in  search  for  an  upper  limit  with  a  given  confidence 
p.  In  this  case  of  n  —  0  the  equation  simplifies  to 

1  -  P  =  e =>  Xup  =  -X B  -  In 

For  p  —  0.84  and  zero  background  (A b  —  0)  this  corresponds  to  an  upper  limit  of 
A  up  —  —  In  0.1 6  =  1.83.  This  example  can  also  be  used  to  test  the  accuracy  of  the 
approximation  given  by  (5.22).  Using  n0bs  —  0,  we  obtain 

A  Up  =  1(1  +  Cl)2  =  1.87 

which  is  in  fact  just  2  %  higher  than  the  exact  result.  An  example  of  upper  limits  in 
the  presence  of  a  non-zero  background  is  presented  in  Problem  5.8.  O 


Table  5.4  Selected  Upper  and  Lower  limits  for  a  Poisson  variable  using  the  Gehrels  approxima¬ 
tion  (see  Tables  A. 5  and  A. 6  for  a  complete  list  of  values) 


Poisson  parameter  S  or  confidence  level 

S  =  1 

5  =  2 

5  =  3 

riobs 

(l-cr,  or  84.1  %) 

(2-o,  or  97.7  %) 

(3-o,  or  99.9  %) 

Upper  limit 


0 

1.87 

3.48 

5.60 

1 

3.32 

5.40 

7.97 

. . . 

Lower  limit 


. . . 

9 

6.06 

4.04 

2.52 

10 

6.90 

4.71 

3.04 

. . . 
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5.5  Bayesian  Methods  for  the  Poisson  Mean 

The  Bayesian  method  consists  of  determining  the  posterior  probability  P(X/obs), 
having  calculated  the  likelihood  as 


P(nobs/X )  = 


L±M2L-<a+ab). 

Mobs  • 


(5.25) 


We  use  Bayes’  theorem, 


P(X/obs)  =  f(WWA) - 

/0  P{nobs/X')Ti(X'),dX' 


(5.26) 


in  which  we  needed  to  introduce  a  prior  probability  distribution  7t(A)  in  order  to 
calculate  the  posterior  probability.  The  use  of  a  prior  distribution  is  what  constitutes 
the  Bayesian  approach.  The  simplest  assumption  is  that  of  a  uniform  prior,  n(X)  — 
C,  over  an  arbitrarily  large  range  of  A  >  0,  but  other  choices  are  possible  according 
to  the  information  available  on  the  Poisson  mean  prior  to  the  measurements.  In  this 
section  we  derive  the  Bayesian  expectation  of  the  Poisson  mean  and  upper  and  lower 
limits  and  describe  the  differences  with  the  classical  method. 


5.5.1  Bayesian  Expectation  of  the  Poisson  Mean 

The  posterior  distribution  of  the  Poisson  mean  A  (5.26)  can  be  used  to  calculate  the 
Bayesian  expectation  for  the  mean  can  be  calculated  as  the  integral  of  (5.26)  over 
the  entire  range  allowed  to  the  mean, 


E[X/obs\  = 


/0°°  P( A/ obs)XdX 
/0°°  P(X  /  obs)dX 


(5.27) 


The  answer  will  in  general  depend  on  the  choice  of  the  prior  distribution  7i(A). 
Assuming  a  constant  prior,  the  expectation  becomes 


E[X/obs\  — 


-L 


^  ^  X  n°^s  i 


n\ 


dX  —  n0bs  +  1, 


where  we  made  use  of  the  integral 


(5.28) 


e~xXndX  =  n\ 
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The  interesting  result  is  therefore  that  a  measurement  of  nobs  counts  implies  a 
Bayesian  expectation  of  E[ A]  =  nobs  +  1,  i.e.,  one  count  more  than  the  observation. 
Therefore  even  a  non-detection  results  in  an  expectation  for  the  mean  of  the  parent 
distribution  of  1,  and  not  0.  This  somewhat  surprising  result  can  be  understood  by 
considering  the  fact  that  even  a  parent  mean  of  1  results  in  a  likelihood  of  l/e  (i.e.,  a 
relatively  large  number)  of  obtaining  zero  counts  as  a  result  of  a  random  fluctuations. 
Moreover,  the  Poisson  distribution  is  skewed,  with  a  heavier  tail  at  large  values  of 
A.  This  calculation  is  due  to  Emslie  [12]. 


5.5.2  Bayesian  Upper  and  Lower  Limits  for  a  Poisson  Variable 


Using  a  uniform  prior,  we  use  the  Bayesian  approach  (5.26)  to  calculate  the  upper 
limit  to  the  source  mean  with  confidence  /3  (say,  95  %).  This  is  obtained  by 
integrating  (5.26)  from  the  lower  limit  of  0  to  the  upper  limit  Xup, 

=  fX“ ”  P(nobs/X)dX  =  /^(X  +  X 

P  f0°°P(nobs/X)dX  f™(X+XBy^e-^)dX  j 

Similarly,  the  lower  limit  can  be  estimated  according  to 


/oAfo  P(nobs/X)dks  =  /0A,°(  A  +  XBy*’e-V+Mdk 

/0°°  P(nobs/X)dX  /0°°(A  +  XB)n°bse~^+x^dX 


where  a  is  a  small  probability,  say  5  %.  Since  nobs  is  always  an  integer,  these 
integrals  can  be  evaluated  analytically. 

The  difference  between  the  classical  upper  limits  described  by  (5.19)  and  (5.20) 
and  the  Bayesian  limits  of  (5.29)  and  (5.30)  is  summarized  by  the  different  variable 
of  integration  (or  summation)  in  the  relevant  equations.  For  the  classical  limits  we 
use  the  Poisson  probability  to  make  nobs  measurements  for  a  true  mean  of  A.  We  then 
estimate  the  upper  or  lower  limits  as  the  values  of  the  mean  that  gives  a  probability 
of,  respectively,  1  —  /3  and  a,  to  observe  n  <  nobs  events.  In  this  case,  the  probability 
is  evaluated  as  a  sum  over  the  number  of  counts,  for  a  fixed  value  of  the  parent  mean. 

In  the  case  of  the  Bayesian  limits,  on  the  other  hand,  we  first  calculate  the 
posterior  distribution  of  A,  and  then  require  that  the  range  between  0  and  the  limits 
A up  and  A i0  includes,  respectively,  a  1  —  /3  and  a  probability,  evaluated  as  an  integral 
over  the  mean  for  a  fixed  value  of  the  detected  counts.  In  general,  the  two  methods 
will  give  different  results. 

Example  5.9  ( Bayesian  Upper  Limit  to  a  Non-detection )  The  case  of  non-detection, 
n0bs  —  0,  is  especially  simple  and  interesting,  since  the  background  drops  out  of  the 
equation,  resulting  in  /3  =  1  —  e~XuP ,  which  gives 


A 


up  - 


ln(l  —  ft)  (case  of  nobs  —  0,  Bayesian  upper  limit) 


(5.31) 
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The  Bayesian  upper  limit  is  therefore  equal  to  the  classical  limit,  when  there  is  no 
background.  When  there  is  background,  the  two  estimate  will  differ.  O 


Summary  of  Key  Concepts  for  this  Chapter 

□  Maximum  Likelihood  (ML)  method :  A  method  to  estimate  parameters  of  a 
distribution  under  the  assumption  that  the  best-fit  parameters  maximize  the 
likelihood  of  the  measurements. 

□  ML  estimates  of  mean  and  variance :  For  a  Gaussian  variable,  the  unbiased 
ML  estimates  are 


dML  ~  X 

2  1 
S  — 


-\2 


N-  1 


Efe  -  x) 


□  Estimates  of  mean  with  non-uniform  uncertainties :  They  are  given  by 


dML  — 


al  = 


E  i  V2 


Ei  M 


□  Confidence  intervals :  Range  of  the  variable  that  contains  a  given  proba¬ 
bility  of  occurrence  (e.g.,  =Lla  range  contains  68%  of  probability  for  a 
Gaussian  variable). 

□  Upper  and  lower  limits :  An  upper  (lower)  limit  is  the  value  below  (above) 
which  there  is  a  given  probability  (e.g.,  90  %)  to  observe  the  variable. 


Problems 

5.1  Using  the  definition  of  weighted  sample  mean  as  in  (5.8),  derive  its  variance 
and  show  that  it  is  given  by  (5.9). 

5.2  Using  the  data  from  Mendel’s  experiment  (Table  1.1),  calculate  the  standard 
deviation  in  the  measurement  of  each  of  the  seven  fractions  of  dominants,  and  the 
weighted  mean  and  standard  deviation  of  the  seven  fractions. 

Compare  your  result  from  a  direct  calculation  of  the  overall  fraction  of  domi¬ 
nants,  obtained  by  grouping  all  dominants  from  the  seven  experiments  together. 


Additional  considerations  on  the  measurements  of  the  mean  of  a  Poisson  variable,  and  the  case  of 
upper  and  lower  limits,  can  be  found  in  [10]. 
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5.3  The  Mendel  experiment  of  Table  1.1  can  be  described  as  n  number  n  of 
measurements  of  rii ,  the  number  of  plants  that  display  the  dominant  character,  out 
of  a  total  of  N[  plants.  The  experiment  is  described  by  a  binomial  distribution  with 
probability  p  =  0.75  for  the  plant  to  display  the  dominant  character. 

Using  the  properties  of  the  binomial  distribution,  show  analytically  that  the 
weighted  average  of  the  measurements  of  the  fraction  f  —  rit/Ni  is  equal  to  the 
value  calculated  directly  as 


En 

i=lUi 

LIU  * 


5.4  Consider  a  decaying  radioactive  source  observed  in  a  time  interval  of  duration 
T  =  1 5  s;  TV  is  the  number  of  total  counts,  and  B  is  the  number  of  background  counts 
(assumed  to  be  measured  independently  of  the  total  counts): 


{TV  =  19  counts 

B  —  14  counts 

The  goal  is  to  determine  the  probability  of  detection  of  source  counts  S  —  N  —  B 
in  the  time  interval  T. 

(a)  Calculate  this  probability  directly  via: 

Prob(detection)  =  Prob(5  >  0/data) 

in  which  S  is  treated  as  a  random  variable,  with  Gaussian  distribution  of  mean 
and  variance  calculated  according  to  the  error  propagation  formulas.  Justify  why 
the  Gaussian  approximation  may  be  appropriate  for  the  variable  S. 

(b)  Use  the  same  method  as  in  (a),  but  assuming  that  the  background  B  is  known 

without  error  (e.g.,  as  if  it  was  observed  for  such  along  time  interval  that  its 

error  becomes  negligible). 

(c)  Assume  that  the  background  is  a  variable  with  mean  of  14  counts  in  a  15  s 
interval,  and  that  it  can  be  observed  for  an  interval  of  time  T  15  s.  Find  what 
interval  of  time  T  makes  the  error  &b\5  of  the  background  over  a  time  interval 
of  15-s  have  a  value  gb\s/B\s  —  0.01,  e.g.,  negligible. 

5.5  For  the  Thomson  experiment  of  Table  2.1  (tube  1)  and  Table  2.2  (tube  2), 
calculate: 

(a)  The  90  %  central  confidence  intervals  for  the  variable  v ; 

(b)  The  90  %  upper  and  lower  limits,  assuming  that  the  variable  is  Gaussian. 

5.6  Consider  a  Poisson  variable  X  of  mean  /x. 

(a)  We  want  to  set  90  %  confidence  upper  limits  to  the  value  of  the  parent  mean 
A,  assuming  that  one  measurement  of  the  variable  yielded  the  result  of  N  —  1. 
Following  the  classical  approach,  find  the  equation  that  determines  the  exact 
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90%  upper  limit  to  the  mean  Xup.  Recall  that  the  classical  90%  confidence 
upper  limit  is  defined  as  the  value  of  the  Poisson  mean  that  yields  a  P(X  < 
N )  =  /3,  where  1  —  /3  —  0.9. 

(b)  Using  the  Bayesian  approach,  which  consists  of  defining  the  1  —  /3  =  0.9  upper 
limit  via 


f>  =  ,5.32, 

/0  P{nobs/n)dn 

where  n0t,s  —  N ;  find  the  equation  that  determines  the  90  %  upper  limit  to  the 
mean  Xup. 

5.7  The  data  provided  in  Table  2.3  from  Pearson’s  experiment  on  biometric  data 
describes  the  cumulative  distribution  function  of  heights  from  a  sample  of  1,079 
couples.  Calculate  the  2a  upper  limit  to  the  fraction  of  couples  in  which  both  mother 
and  father  are  taller  than  68  in. 

5.8  Use  the  data  presented  in  Example  5.7,  in  which  there  is  a  non-detection  of  a 
source  in  the  presence  of  a  background  of  A#  ^  9.8.  Determine  the  Poisson  upper 
limit  to  the  source  count  at  the  99  %  confidence  level  and  compare  this  upper  limit 
with  that  obtained  in  the  case  of  a  zero  background  level. 


Chapter  6 

Mean,  Median,  and  Average  Values  of  Variables 


Abstract  The  data  analyst  often  faces  the  question  of  what  is  the  “best”  value  to 
report  from  N  measurements  of  a  random  variable.  In  this  chapter  we  investigate 
the  use  of  the  linear  average,  the  weighted  average,  the  median  and  a  logarithmic 
average  that  may  be  applicable  when  the  variable  has  a  log-normal  distribution. 
The  latter  may  be  useful  when  a  variable  has  errors  that  are  proportional  to  their 
measurements,  avoiding  the  inherent  bias  arising  in  the  weighted  average  from 
measurements  with  small  values  and  small  errors.  We  also  introduce  a  relative-error 
weighted  average  that  can  be  used  as  an  approximation  for  the  logarithmic  mean  for 
log-normal  distributions. 


6.1  Linear  and  Weighted  Average 

In  the  previous  chapter  (see  Sect.  5.1.3)  we  have  shown  that  the  weighted  mean  is 
the  most  likely  value  of  the  mean  of  the  random  variable.  Therefore,  the  weighted 
mean  is  a  commonly  accepted  quantity  to  report  as  the  best  estimate  for  the  value 
of  a  measured  quantity.  If  the  measurements  have  the  same  standard  deviation,  then 
the  weighted  mean  becomes  the  linear  average;  in  general,  the  linear  and  weighted 
means  differ  unless  all  measurement  errors  are  identical. 

The  difference  between  linear  average  and  weighted  mean  can  be  illustrated 
with  an  example.  Consider  the  N  —  25  measurements  shown  in  Table  6.1, 
which  reports  the  measurement  of  the  energy  of  certain  astronomical  sources 
made  at  a  given  radius  [5].  This  dataset  is  illustrative  of  the  general  situation 
of  the  measurement  of  a  quantity  (in  this  example,  the  ratio  between  the  two 
measurements)  in  the  presence  of  different  measurement  error.  The  weighted  mean 
is  0.90  ±  0.02,  while  the  linear  average  is  1.01  (see  Problem  6.1).  The  difference  is 
clearly  due  to  the  presence  of  a  few  measurements  with  a  low  value  of  the  ratio  that 
carry  higher  weight  because  of  the  small  measurement  error  (for  example,  source 
15). 

Which  of  the  two  values  is  more  representative?  This  question  can  be  addressed 
by  making  the  following  observations.  The  measurement  error  reported  in  the  table 
reflects  the  presence  of  such  sources  of  uncertainty  as  Poisson  fluctuations  in  the 
detection  of  photons  from  the  celestial  sources.  The  same  type  of  uncertainty  would 
also  apply  to  other  experiments,  in  particular  those  based  on  the  counting  of  events. 
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Table  6.1  Dataset  with  measurement  of  energy  for  N  =  25  different  sources  and  their  ratio 


Energy 

Source 

Radius 

Method  #1 

Method  #2 

Ratio 

1 

221.1  ±[23 

8.30±o;gg 

9.67 ± !  }| 

0.86±q;®7 

2 

268. 5 ±207 

4-92 

4.i9±g:“ 

1.17±8;i| 

3 

138.4±if.9 

3.03±®;« 

2.61  ±85? 

1.16±8;2« 

4 

714.3±345 

49.61  ±|!| 

60.62±gi3 

0.82±8;“ 

5 

182.3±[f;f 

2  75+0-49 
'  J=I-0.43 

3.30±8;IJ 

0.83±8i4 

6 

72.1  ±1? 

1-01  ±£20 

o.86±8;!| 

1  17±0-24 

A*A  /=c0.21 

7 

120.3±75 

5.04±°;« 

3.80±«;22 

i.33±8;!| 

8 

196.2±£> 

5.18±a7o 

6.00±  j;j] 

0.86±8;‘t 

9 

265.7 ±U 

12.17±|;i7 

io.56±8;|1 

1  14-1-0.13 

1-14=co.io 

10 

200.0±|q67 

7.74±®j2 

6.26±8;I! 

1  24±0-14 

11 

78.8±j  | 

l-08±a!s 

0-73±8;}o 

1  49+0-26 
i.HVZHo  24 

12 

454.4±203 

17.10±203 

23.12±|;lf 

0  7Si0-07 
u<  /7)=i=0.06 

13 

109.4±|;| 

9  7 1  — 0.34 

9  a/:_l0.54 
j  .uo n0  52 

1  09±°-18 

i.uvn=o  15 

14 

156.5±|>j 

2.36±»;« 

2-31  ±031 

1  02±°-26 
i.uzzCq  23 

15 

218.0±|f 

i4.02±8:?s 

21.59±1|| 

0  65±0-04 
u.o7>=r004 

16 

370.7±™ 

3 1  -4i  ±l;ll 

29.67±};|f 

1.06=1=8:88 

17 

189.1  ±!|;1 

9  1  5-1-0.45 

2  52+0-57 

Z.JZZCo  51 

0-86±8;fi 

18 

150.5±];| 

9  90-1-0.57 

50 

4-7  5±°H 

0-72±8;jJ 

19 

326.7±‘29' 

15.73±|;« 

18-03  ±\f6 

O 

bo 

H- 

0  0 

0  0 

o\  crs 

20 

1 89. 1  ±|;? 

5.04±»;| 

4-61  ±0:50 

i.09±8;!I 

21 

147.7±ii0j 

2.53±»;209 

2.7  6±«;39 

o.93±8:!o 

22 

504.6±!2| 

44.97±2;?9 

43.93±2  ®! 

1 .02±ao5 

23 

170.5±|;f 

3.89±°i°9 

3-93±®;99 

0.98±«:>°9 

24 

297.6±}|;i 

10.78±J;q2 

10.48±|;22 

i-04±8:i? 

25 

256.2±  4 

7.27±°;*‘ 

7.37±®;99 

O 

H- 

0  0 
0  0 

VO  VO 

This  type  of  uncertainty  is  usually  referred  to  as  statistical  error.  Many  experiments 
and  measurements  are  also  subject  to  other  sources  of  uncertainty  that  may  not  be 
explicitly  reported  in  the  dataset.  For  example,  the  measurement  of  events  recorded 
by  a  detector  is  affected  by  the  calibration  of  the  detector,  and  a  systematic  offset 
in  the  calibration  would  affect  the  numbers  recorded.  In  the  case  of  the  data  of 
Table  6.1,  the  uncertainty  due  to  the  calibration  of  the  detector  is  likely  to  affect  by 
the  same  amount  of  all  measurements,  regardless  of  the  precision  indicated  by  the 
statistical  error.  This  type  of  uncertainty  is  typically  referred  to  as  systematic  error , 
and  the  inclusion  of  such  additional  source  of  uncertainty  would  modify  the  value 
of  the  weighted  mean.  As  an  example  of  this  effect,  if  we  add  an  error  of  ±0.1  to 
all  values  of  the  ratio  of  Table  6.1,  the  weighted  mean  becomes  0.95  ±  0.04  (see 
Problem  6.2).  It  is  clear  that  the  addition  of  a  constant  error  for  each  measurement 
causes  a  de- weighting  of  datapoints  with  small  statistical  errors,  and  in  the  limit  of 
a  large  systematic  error  the  weighted  mean  becomes  the  linear  average.  Therefore, 
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the  linear  average  can  be  used  when  the  data  analyst  wants  to  weigh  equally  all 
datapoints,  regardless  of  the  precision  indicated  by  the  statistical  errors.  Systematic 
errors  are  discussed  in  more  detail  in  Chap.  11. 


6.2  The  Median 

Another  quantity  that  can  be  calculated  from  the  N  measurements  is  the  median, 
defined  in  Sect.  2.3.1  as  the  value  of  the  variable  that  is  greater  than  50%  of 
the  measurements,  and  also  lower  than  50  %  of  the  measurements.  In  the  case  of 
the  measurement  of  the  ratios  in  Table  6.1,  this  is  simply  obtained  by  ordering 
the  25  measurements  in  ascending  order,  and  using  the  13th  measurement  as  an 
approximation  for  the  median.  The  value  obtained  in  this  case  is  1 .02,  quite  close 
to  the  value  of  the  linear  average,  since  both  statistics  do  not  take  into  account  the 
measurement  errors. 

One  useful  feature  of  the  median  is  that  it  is  not  very  sensitive  to  “outliers”  in  the 
distribution.  For  example,  if  one  of  the  measurements  was  erroneously  reported  as 
0.07  =b  0.01  (instead  of0.72zb0.11,  such  as  source  18  in  the  Table),  both  linear  and 
weighted  averages  would  be  affected  by  the  error,  but  the  median  would  not.  The 
median  may  therefore  be  an  appropriate  value  to  report  in  cases  where  the  analyst 
suspects  the  presence  of  outliers  in  the  dataset. 


6.3  The  Logarithmic  Average  and  Fractional 
or  Multiplicative  Errors 

The  quantity  “Ratio”  in  Table  6.1  can  be  used  to  illustrate  a  type  of  variables  that 
may  require  a  special  attention  when  calculating  their  averages.  Consider  a  variable 
whose  errors  are  proportional  to  their  measured  values.  In  this  case,  a  weighted 
average  will  be  skewed  towards  lower  values  because  of  the  smaller  errors  in  those 
measurements.  The  question  we  want  address  is  whether  it  is  appropriate  to  use 
a  weighted  average  of  these  measurements  or  whether  one  should  use  a  different 
approach. 

To  illustrate  this  situation,  let’s  use  two  measurements  such  as  x\  —  1.2  zb  0.24 
and  X2  —  0.80  zb  0.16.  Both  measurements  have  a  relative  error  of  20  %,  the  linear 
average  is  1.00  and  the  weighted  average  is  0.923.  The  base- 10  logarithm  of  these 
measurements  are  logvi  =  0.0792  and  \0gx2  —  —0.0969,  with  the  same  error.  In 
fact,  using  the  error  propagation  method  (Sect.  4.7.6),  the  error  in  the  logarithm  is 
proportional  to  the  fractional  error  according  to 

_  ox  1 

^logx  — 


v  In  10 


(6.1) 


110 


6  Mean,  Median,  and  Average  Values  of  Variables 


For  our  measurements,  this  equation  gives  a  value  of  <J\ogx  =  0.087  for  both 
measurements.  The  weighted  average  of  these  logarithms  is  therefore  the  linear 
average  1  ogx  =  —0.0088,  leading  to  an  average  of  x  —  0.980.  This  value  is  much 
closer  to  the  linear  average  of  1 .00  than  to  the  weighted  average. 

Errors  that  are  exactly  proportional  to  the  measurement,  or 

gx  —  xor  (6.2) 

may  be  called  fractional  or  multiplicative  errors.  The  quantity  oy  is  the  relative  error 
and  it  remains  constant  for  purely  multiplicative  errors.  In  most  cases,  including  that 
of  Table  6.1,  the  relative  error  ox/x  varies  among  the  measurements,  and  therefore 
(6.2)  applies  only  as  an  approximation.  In  the  following  we  investigate  when  it  is  in 
fact  advisable  to  use  the  logarithm  of  measurements,  instead  of  the  measurements 
themselves,  to  obtain  a  more  accurate  determination  of  the  mean  of  a  variable  that 
has  multiplicative  errors. 


6.3.1  The  Weighted  Logarithmic  Average 


The  maximum  likelihood  method  applied  to  the  logarithm  of  measurements  of  a 
variable  X  can  be  used  to  estimate  the  mean  and  the  error  of  log  X.  The  weighted 
logarithmic  average  of  N  measurements  xt  is  defined  as 


logx  = 


Zw=i 


1 

/y2 

°log  Xj 


(6.3) 


where  cr^  is  the  variance  of  the  logarithm  of  the  measurements,  which  can  be 
obtained  from  (6.1).  The  uncertainty  in  the  weighted  logarithmic  average  is  given 
by 


1 


V-^jV 

2^=i 


1 

/y2 

Qlogx/ 


(6.4) 


The  use  of  this  logarithmic  average  is  justified  when  the  variable  X  has  a  log¬ 
normal  distribution,  i.e.,  when  logX  has  a  Gaussian  distribution,  rather  than  the 
variable  X  itself.  An  example  of  a  log-normal  variable  is  illustrated  in  Fig.  6.1. 
In  this  case,  the  maximum  likelihood  method  estimator  of  the  mean  of  logX  is 
the  logarithmic  mean  of  (6.3).  Clearly,  a  variable  can  only  be  log-normal  when 
the  variable  has  positive  values,  such  as  the  ratio  of  two  positive  quantities.  The 
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Variable  X 


Variable  log  X 


Fig.  6.1  Log-normal  distribution  with  mean  /x  =  0  and  standard  deviation  cr  =  0.3  ( black  line) 
and  linear  plot  of  the  same  distribution  ( red  line).  A  heavier  right-hand  tail  in  the  linear  plot  may 
be  indicative  of  a  log-normal  distribution 


determination  of  the  log-normal  shape  can  be  made  if  one  has  available  random 
samples  from  its  distribution. 

In  the  limit  of  measurements  with  the  same  fractional  error  and  small  deviations 
from  the  mean  /x,  the  weighted  logarithmic  average  is  equivalent  to  the  linear 
average. 

Proof  This  can  be  shown  by  proving  that 


logv  =  logv 

where  x  is  the  ordinary  linear  average.  Notice  that  logv  in  (6.3)  is  a  base- 10 
logarithm.  In  this  proof  we  make  use  the  base-£  logarithm  (lnv),  the  two  are 
related  by 


logv  =  lnv/  In  10. 

Consider  N  measurements  Xi  in  the  neighborhood  of  the  mean  /x  of  the  random 
variable,  xi  =  /x  +  Axi.  A  Taylor  series  expansion  yields 


i  i  n  ,  Axi\  i  ,  Axi  ( A*i //T  , 

In  Xi  =  in  /x(l  H - )  =  ln/x  H - - - b  . . . 


/x 


/x 
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If  the  deviation  is  Axt  /x,  one  can  neglect  terms  of  the  second  order  and 
higher.  The  average  of  the  logarithms  of  the  N  measurements  can  thus  be 
approximated  as 


On  the  other  hand,  the  logarithm  of  the  mean  v  is 

.if  1  /  i  A  AxA 

log*  =  °g  Jf  E*'  =  !°g  I  Jf  E^1  +  — )  1  • 

i=  1  \  i=  1  / 


This  leads  to 


where  we  retained  only  the  first-order  term  in  the  Taylor  series  expansion  of 
the  logarithm  since  Ax;  /  /x  N.  □ 

As  discussed  earlier  in  this  section,  the  logarithmic  average  is  an  appropriate 
quantity  for  log-normal  distributed  variables.  The  results  of  this  section  show 
that  this  average  is  closer  to  the  linear  average  of  the  measurements  than  the 
standard  weighted  average,  when  measurement  errors  are  positively  correlated  to 
the  measurements  themselves. 


Example  6. 1  The  data  of  Table  6. 1  can  be  used  to  calculate  the  logarithmic  average 
of  the  column  “Ratio”  according  to  (6.3)  and  (6.4)  as  logv  =  —0.023  ±  0.018. 
These  quantities  can  be  converted  easily  to  linear  quantities  taking  into  account  the 
error  propagation  formula  o\0%x  =  a/ (vln  10),  to  obtain  a  value  of  0.95  ±  0.04. 

Notice  how  the  logarithmic  mean  has  a  value  that  is  somewhat  between  that  of 
the  linear  average  x  —  1.01  and  the  traditional  weighted  average  of  0.90  ±  0.02.  It 
should  not  be  surprising  that  the  logarithmic  mean  is  not  exactly  equal  to  the  linear 
average.  In  fact,  the  measurements  of  Table  6.1  have  different  relative  errors.  Only 
in  the  case  of  identical  relative  errors  for  all  measurements  we  expect  that  the  two 
averages  have  the  same  value.  ^ 
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6.3.2  The  Relative-Error  Weighted  Average 


Although  transforming  measurements  to  their  logarithms  is  a  simple  procedure, 
we  also  want  to  investigate  another  type  of  average  that  deals  directly  with  the 
measurements  without  the  need  to  calculate  their  logarithms. 

We  introduce  the  relative -error  weighted  average  as 


xre 


Td=l  Xi/iPi/Xi)2 

ee  i  /(oi/xi)2 


(6.5) 


The  only  difference  with  the  weighted  mean  defined  in  Sect.  5.1.3  is  the  use  of 
the  extra  factor  of  Xi  in  the  error  term,  so  that  07/v;  is  the  relative  error  of  each 
measurement. 

The  reason  to  introduce  this  new  average  is  that,  for  log-normal  variables,  this 
relative-error  weighted  mean  is  equivalent  to  the  logarithmic  mean  of  (6.3).  This  can 
be  proven  by  showing  that  In  v  =  In  xre. 

Proof  Start  with  the  logarithm  of  the  relative-error  weighted  average, 


In  xre  —  In 


2^=i 
2w=  1 


Xi/iOi/xV 


\/(oi/x,)2 


/Ef=1V<A 
V£f=i  !/<J' 


From  this,  expand  the  measurement  term  Xi  —  /a  +  Axu  where  /z  is  the  parent 
mean  of  the  variable  X , 


Ef=, 


In  I  fi  T- 


Ef=i 


Et,  1A,2 


=  In  fi  +  In  I  1  + 


Ef=i  l/cr? 


log  Xi 


logx/ 


If  Axi  <  /z,  then 


EEi  Axi/ina^.) 


In  xRE  =  In  /z  + 


Ef=,  i/<. 


leading  to 


1  Eti  Axi/(^LX) 


log  XRE  =  log  11  + 


log. 


lnl°  Ef=,i/<, 


The  logarithmic  average  can  also  be  expanded  making  use  of 


E 


log  A, 
0-2 

ai0g*; 


E 


log  fx  +  log(l  +  Ax,)/ /x 
^2 

alogx/ 


N 


i=  1 
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This  leads  to 


logx  =  log  fl  + 


1  EL  Axi/ifiol  ) 


In  10 


2^i=  i 


logx^. 


□ 

The  use  of  the  relative-error  weighted  average  should  be  viewed  as  an  ad  hoc  method 
to  obtain  an  average  value  that  is  consistent  with  the  logarithmic  average,  especially 
in  the  limit  measurements  with  equal  relative  errors.  The  statistical  uncertainty  in 
this  error-weighted  average  can  be  simply  assigned  as  the  error  in  the  traditional 
weighted  average  (5.8).  In  fact,  the  statistical  error  should  be  determined  by  the 
“physical”  uncertainties  in  the  measurements,  as  is  the  case  for  the  variance  in  (5.8). 
It  would  be  tempting  to  use  the  inverse  of  the  denominator  of  (6.5)  as  the  variance; 
however,  the  result  would  be  biased  by  our  somewhat  arbitrary  choice  of  weighing 
the  measurements  by  the  relative  errors,  instead  of  the  error  themselves. 

Example  6.2  Continuing  with  the  values  of  “Ratio”  in  Table  6. 1 ,  the  error- weighted 
average  is  calculated  as  xLe  —  0.96.  The  error  in  the  traditional  weighted  average 
was  0.02,  therefore  we  may  report  the  result  as  0.96  d=  0.02.  Comparison  with  the 
values  of  0.95  =b  0.04  for  the  logarithmic  average  shows  the  general  agreement 
between  these  two  values. 

❖ 

Summary  of  Key  Concepts  for  this  Chapter 

□  Linear  average :  The  mean  x  of  N  measurements. 

□  Median :  The  50  %  quantile,  or  the  number  below  and  above  which  there 
are  50  %  of  the  variable’s  values. 

□  Logarithmic  average :  In  some  cases  (e.g.,  when  errors  are  proportional  to 
the  measured  values)  it  is  meaningful  to  calulate  the  weighted  average  of 
the  logarithm  of  the  variable, 


log.v 

< 

^logx 

where  OiogXi  =  a,/  (x,  In  2). 

□  Relative-error  weighted  average:  An  approximation  of  the  logarithmic 
average  that  does  not  require  logarithms, 

_  _  J2xi/(°i/xi)2 

XRE  ~  e  moi/xv  ■ 


Elog  Xi/olgXi 

EV<, 
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Problems 

6.1  Calculate  the  linear  average  and  the  weighted  mean  of  the  quantity  ‘‘Ratio”  in 
Table  6.1. 

6.2  Consider  the  25  measurements  of  “Ratio”  in  Table  6.1.  Assume  that  an 
additional  uncertainty  of  ±0.1  is  to  be  added  linearly  to  the  statistical  error  of 
each  measurement  reported  in  the  table.  Show  that  the  addition  of  this  source  of 
uncertainty  results  in  a  weighted  mean  of  0.95  ±  0.04. 

6.3  Given  two  measurements  x\  and  x 2  with  values  in  the  neighborhood  of  1 .0, 
show  that  the  logarithm  of  the  average  of  the  measurements  is  approximately  equal 
to  the  average  of  the  logarithms  of  the  measurements. 

6.4  Given  two  measurements  x\  and  x 2  with  values  in  the  neighborhood  of  a 
positive  number  A,  show  that  the  logarithm  of  the  average  of  the  measurements 
is  approximately  equal  to  the  average  of  the  logarithms  of  the  measurements. 

6.5  For  the  data  in  Table  6.1,  calculate  the  linear  average,  weighted  average 
and  median  of  each  quantity  (Radius,  Energy  Method  1,  Energy  Method  2  and 
Ratio).  You  may  assume  that  the  error  of  each  measurements  is  the  average  of  the 
asymmetric  errors  of  each  measurement  reported  in  the  table. 

6.6  Table  6.1  contains  the  measurement  of  the  thermal  energy  of  certain  sources 
using  two  independent  methods  labeled  as  method  #1  and  method  #2.  For  each 
source,  the  measurement  is  made  at  a  given  radius,  which  varies  from  source 
to  source.  The  error  bars  indicate  the  68%,  or  la,  confidence  intervals;  the  fact 
that  most  are  asymmetric  indicate  that  the  measurements  do  not  follow  exactly  a 
Gaussian  distribution.  Calculate  the  weighted  mean  of  the  ratios  between  the  two 
measurements  and  its  standard  deviation,  assuming  that  the  errors  are  Gaussian  and 
equal  to  the  average  of  the  asymmetric  errors,  as  it  is  often  done  in  this  type  of 
situation. 


Chapter  7 

Hypothesis  Testing  and  Statistics 


Abstract  Every  quantity  that  is  estimated  from  the  data,  such  as  the  mean  or 
the  variance  of  a  Gaussian  variable,  is  subject  to  statistical  fluctuations  of  the 
measurements.  For  this  reason  they  are  referred  to  as  a  statistics.  If  a  different 
sample  of  measurements  is  collected,  statistical  fluctuations  will  certainly  give  rise 
to  a  different  set  of  measurements,  even  if  the  experiments  are  performed  under  the 
same  conditions.  The  use  of  different  data  samples  to  measure  the  same  statistic 
results  in  the  determination  of  the  sampling  distribution  of  the  statistic,  to  describe 
what  is  the  expected  range  of  values  for  that  quantity.  In  this  chapter  we  derive  the 
distribution  of  a  few  fundamental  statistics  that  play  a  central  role  in  data  analysis, 
such  as  the  y2  statistic.  The  distribution  of  each  statistic  can  be  used  for  a  variety  of 
tests,  including  the  acceptance  or  rejection  of  the  fit  to  a  model. 


7.1  Statistics  and  Hypothesis  Testing 

In  this  book  we  have  already  studied  several  quantities  that  are  estimated  from  the 
data,  such  as  the  sample  mean  and  the  sample  variance.  These  quantities  are  subject 
to  random  statistical  fluctuations  that  occur  during  the  measurement  and  collection 
process  and  they  are  often  referred  to  as  random  variables  or  statistics.  For  example, 
a  familiar  statistic  is  the  sample  mean  of  a  variable  X.  Under  the  hypothesis  that  the 
variable  X  follows  a  Gaussian  distribution  of  mean  fi  and  variance  a2,  the  sample 
mean  of  N  measurements  is  Gaussian-distributed  with  mean  \i  and  variance  equal 
to  g2/N  (see  Sect.  4.1.2).  This  means  that  different  samples  of  size  N  will  in  general 
give  rise  to  different  sample  means  and  that  ones  expects  a  variance  of  order  o2  /N 
among  the  various  samples.  This  knowledge  lets  us  establish  whether  a  given  sample 
mean  is  consistent  with  this  theoretical  expectation. 

Hypothesis  testing  is  the  process  that  establishes  whether  the  measurement 
of  a  given  statistic,  such  as  the  sample  mean,  is  consistent  with  its  theoretical 
distribution.  Before  describing  this  process  in  detail,  we  illustrate  with  the  following 
example  the  type  of  statistical  statement  that  can  be  made  from  a  given  measurement 
and  the  knowledge  of  its  parent  distribution. 

Example  7.1  Consider  the  case  of  the  measurement  of  the  ratio  m/e  from  Tube  1 
of  Thomson’s  experiment,  and  arbitrarily  assume  (this  assumption  will  be  relaxed 
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in  more  realistic  applications)  that  the  parent  mean  is  known  to  be  equal  to  /x  — 
0.475,  and  that  the  parent  variance  is  a  =  0.075.  We  want  to  make  quantitative 
statements  regarding  the  possibility  that  the  measurements  are  drawn  from  the 
parent  distribution. 

For  example,  we  can  make  the  following  statements  concerning  the  measurement 
m/e  =  0.42:  since  m/e  =  /x  —  0.73a,  there  is  a  probability  of  24%  that  a 
measurement  of  0.42  or  lower  is  recorded.  This  statement  addresses  the  fact  that, 
despite  the  measurement  fell  short  of  the  parent  mean,  there  is  still  a  significant 
(24  %)  chance  that  any  given  measurement  will  be  that  low,  or  even  lower.  We  can 
also  make  this  statement:  the  measurement  is  within  the  1  —  a  central  confidence 
interval,  which  encompasses  68  %  of  the  probability.  This  statement  looks  at  the 
distance  of  the  measurement  from  the  mean,  regardless  of  its  sign. 

Before  we  can  say:  the  measurement  is  consistent  with  the  parent  distribution, 
we  need  to  quantify  the  meaning  of  the  word  consistent.  <> 

The  process  of  hypothesis  testing  requires  a  considerable  amount  of  care  in  the 
definition  the  hypothesis  to  test  and  in  drawing  conclusions.  The  method  can  be 
divided  into  the  following  four  steps. 

1.  Begin  with  the  definition  of  a  hypothesis  to  test.  For  the  measurements  of  a 
variable  X,  a  possible  hypothesis  is  that  the  measurements  are  consistent  with 
a  parent  mean  of  /x  =  0  and  a  variance  of  a2  =  1 .  For  a  fit  of  a  dataset  to  a  linear 
model  (Chap.  8)  we  may  want  to  test  whether  the  linear  model  is  a  constant,  i.e., 
whether  the  parent  value  of  the  slope  coefficient  is  b  —  0.  This  initial  step  in 
the  process  identifies  a  so-called  null  hypothesis  that  we  want  to  test  with  the 
available  data. 

2.  The  next  step  is  to  determine  the  statistic  to  use  for  the  null  hypothesis.  In  the 
example  of  the  measurements  of  a  variable  X,  the  statistic  we  can  calculate  from 
the  data  is  the  sample  mean.  For  the  fit  to  the  linear  model,  we  will  learn  that  the 
X2min  is  the  statistic  to  use  for  a  Gaussian  dataset.  The  choice  of  statistic  means  that 
we  are  in  a  position  to  use  the  theoretical  distribution  function  for  that  statistic  to 
tell  whether  the  actual  measurements  are  consistent  with  its  expected  distribution, 
according  to  the  null  hypothesis. 

3.  Next  we  need  to  determine  a  probability  or  confidence  level  for  the  agreement 
between  the  statistic  and  its  expected  distribution  under  the  null  hypothesis.  This 
level  of  confidence  p ,  say  p  —  0.9  or  90%,  defines  a  range  of  values  for  the 
statistics  that  are  consistent  with  its  expected  distribution.  We  will  refer  to  this 
range  as  the  acceptable  region  for  the  statistic.  For  example,  a  standard  Gaussian 
of  zero  mean  and  unit  variance  has  90%  of  its  values  in  the  range  from  =1.65 
to  +1.65.  For  a  confidence  level  of  p  —  0.9,  the  analyst  would  require  that  the 
measurement  must  fall  within  this  range.  The  choice  of  probability  p  is  somewhat 
arbitrary:  some  analysts  may  choose  90  %,  some  may  require  99.99  %,  some  may 
even  be  satisfied  with  68%,  which  is  the  probability  associated  with  ±la  for 
a  Gaussian  distribution.  Values  of  the  statistics  outside  of  the  acceptable  range 
define  the  rejection  region.  For  the  standard  Gaussian,  the  rejection  region  at 
p  —  0.9  consists  of  values  >  1.65  and  values  <  —1.65,  i.e.,  the  rejection  region 
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is  two-sided,  as  obtained  from 

P{\S\>S}=1-  [  f(s)ds=l-p  (7.1) 

J-s 

where  f(s )  is  the  probability  distribution  of  the  statistic  (in  this  example  the 
standard  Gaussian)  and  S  is  the  critical  value  of  the  statistic  at  the  level  of 
confidence  p.  For  two-sided  rejection  regions  such  as  this,  where  large  values 
of  the  absolute  value  of  the  statistic  S  are  not  acceptable,  the  null  hypothesis  can 
be  summarized  as 


Ho  —  {The  statistic  has  values  \S\  <  S} 

Here  we  have  assumed  that  the  acceptable  region  is  centered  at  0,  but  other 
choices  are  also  possible. 

In  other  cases  of  interest,  such  as  for  the  / 2  distribution,  the  rejection  region 
is  one-sided.  The  critical  value  at  confidence  level  p  for  the  statistic  can  be  found 
from 


POO 

P{S>S}=  f{s)ds=\-p  (7.2) 

J~S 

wher e/(y)  is  the  probability  distribution  function  of  the  statistic  S.  For  one-sided 
rejection  regions  where  large  values  of  the  statistic  are  not  acceptable  the  null 
hypothesis  can  now  be  summarized  as 

Ho  =  {The  statistic  has  values  S  <  S}. 

Clearly  p  and  S  are  related:  the  larger  the  value  of  the  probability  p ,  the  larger  the 
value  of  S ,  according  to  (7.1)  and  (7.2).  Larger  values  of  p ,  such  as  p  —  0.9999, 
increase  the  size  of  the  acceptable  region  and  reduce  the  size  of  the  rejection 
region. 

In  principle,  other  choices  for  the  acceptable  and  rejection  regions  are 
possible,  such  as  multiple  intervals  or  intervals  that  are  not  centered  at  zero.  The 
corresponding  critical  value(s)  of  the  statistic  can  be  calculated  using  expression 
similar  to  the  two  reported  above.  The  majority  of  cases  for  the  rejection  region 
are,  however,  either  a  one-sided  interval  extending  to  infinity  or  a  two-sided 
region  centered  at  zero. 

4.  Finally  we  are  in  a  position  to  make  a  quantitative  and  definitive  statement 
regarding  the  null  hypothesis.  Since  we  have  partitioned  the  range  of  the  statistic 
into  an  acceptable  region  and  a  rejection  region,  only  two  cases  are  possible: 

•  Case  7:  The  measured  value  of  the  statistic  S  falls  into  the  rejection  region. 
This  means  that  the  distribution  function  of  the  statistic  of  interest,  under  the 
null  hypothesis,  does  not  allow  the  measured  value  at  the  confidence  level  p. 
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In  this  case  the  null  hypothesis  must  be  rejected  at  the  stated  confidence  level 
p.  The  rejection  of  the  null  hypothesis  means  that  the  data  should  be  tested  for 
alternative  hypotheses  and  the  procedure  can  be  repeated. 

•  Case  2:  The  measured  value  of  the  statistic  S  is  within  the  acceptable  region. 
This  means  that  there  is  a  reasonable  probability  that  the  measured  value  of  the 
statistic  is  consistent  with  the  null  hypothesis.  In  that  case  the  null  hypothesis 
cannot  be  rejected ,  i.e.,  the  null  hypothesis  could  be  true.  In  this  case  one  can 
state  that  the  null  hypothesis  or  the  underlying  model  is  consistent  with  the 
data.  Sometimes  this  situation  can  be  referred  to  as  the  null  hypothesis  being 
acceptable .  This  is,  however,  not  the  same  as  stating  that  the  null  hypothesis 
is  the  correct  hypothesis  and  that  the  null  hypothesis  is  accepted.  In  fact,  there 
could  be  other  hypotheses  that  could  be  acceptable  and  one  cannot  be  certain 
that  the  null  hypothesis  tested  represents  the  parent  model  for  the  data. 

Example  7.2  Consider  N  —  5  independent  measurements  of  a  random  variable 
X ,  namely  xi  —  (10, 12, 15, 11, 13).  We  would  like  to  test  the  hypothesis  that 
the  measurements  are  drawn  from  a  Gaussian  random  variable  with  /x  =  13  and 
a2  =  2). 

Next  we  need  to  determine  the  test  statistic  that  we  want  to  use.  Since  there  are 
N  independent  measurements  of  the  same  variable,  we  can  consider  the  sum  of  all 
measurements  as  the  statistic  of  interest, 


i=  1 

which  is  distributed  like  a  Gaussian  N(N  •  /x,  N  •  a2)  =  N(65, 10).  We  could  have 
chosen  the  average  of  the  measurements  instead.  It  can  be  proven  that  the  results  of 
the  hypothesis  testing  are  equivalent  for  the  two  statistics. 

The  next  step  requires  the  choice  of  a  confidence  level  for  our  hypothesis. 
Assume  that  we  are  comfortable  with  a  value  of  p  —  95%  level.  This  means  that 
the  rejection  region  includes  values  that  are  ±  1 .96a  (or  ±  6.2  units)  away  from  the 
parent  mean  of  pi  —  65,  as  shown  by  the  cross-hatched  are  in  Fig.  7.1. 

Next,  we  calculate  the  value  of  the  statistic  as  f  =  61,  and  realize  that  the 
measured  value  does  not  fall  within  the  region  of  rejection.  We  conclude  that  the 
data  are  consistent  with  the  hypothesis  that  the  measurements  are  drawn  from  the 
parent  Gaussian  at  the  95  %  probability  level  (or  1.96a  level). 

Assume  next  that  another  analyst  is  satisfied  with  a  p  —  68  %  probability,  instead 
of  95  %.  This  means  that  the  region  of  rejection  will  be  dzl.Oa  =  1.0  •  y/lO  =  3.2 
away  from  the  mean.  In  this  case,  the  rejection  region  becomes  the  hatched  area  in 
Fig.  7.1,  and  the  measured  value  of  the  test  statistic  Y  falls  in  the  rejection  region.  In 
this  case,  we  conclude  that  the  hypothesis  must  be  rejected  at  the  68  %  probability 
level  (or  at  the  la  level).  O 

The  example  above  illustrates  the  importance  of  the  choice  of  the  confidence 
level  p — the  same  null  hypothesis  can  be  acceptable  or  must  be  rejected  depending 
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Fig.  7.1  Rejection  regions  at  p  =  0.95  and  p  =  0.68  confidence  level  for  the  test  of  the  Gaussian 
origin  of  measurements  jc,-  =  (10,  12,  15,  11,  13).  The  null  hypothesis  is  that  the  sum  of  the 
measurements  are  drawn  from  a  random  variable  Y  ~  /V(/.x  =  65,  a2  =  10) 


on  its  value.  To  avoid  this  ambiguity,  some  analysts  prefer  to  take  a  post-facto 
approach  to  the  choice  of  p.  In  this  example,  the  measured  value  of  the  sample  mean 
corresponds  to  an  absolute  value  of  the  deviation  of  1.26a  from  the  parent  mean. 
Such  deviation  corresponds  to  a  probability  of  approximately  79  %  to  exceed  the 
parent  mean.  It  is  therefore  possible  to  report  this  result  with  the  statement  that  the 
data  are  consistent  with  the  parent  model  at  the  79  %  confidence  level.  In  general, 
for  a  two-dimensional  rejection  region,  the  measurement  Sdata  corresponds  to  a  level 
of  confidence  p  via 


P{S  >  |Sdata|}  =  1  - 


1  ~P, 


(7.3) 


wher e/(s)  is  the  probability  distribution  of  the  test  statistic  under  the  null  hypothesis 
(an  equivalent  expression  applies  to  a  one-sided  rejection  region).  This  equation  can 
be  used  to  make  the  statement  that  the  measurement  of  Sdata  is  consistent  with  the 
model  at  the  p  confidence  level. 

It  is  necessary  to  discuss  further  the  meaning  of  the  word  “acceptable”  with 
regard  to  the  null  hypothesis.  The  fact  that  the  measurements  were  within  1-a  of 
a  given  mean  does  not  imply  that  the  parent  distribution  of  the  null  hypothesis  is  the 
correct  one;  in  fact,  there  could  be  other  parent  distributions  that  are  equally  well 
“acceptable.”  Therefore,  any  null  hypothesis  can  only  be  conclusively  disproved  (if 
the  measurements  were  beyond,  say,  3-  or  5 -a  of  the  parent  mean,  depending  on  the 
choice  of  probability  p ),  but  never  conclusively  proven  to  be  the  correct  one,  since 
this  would  imply  exhausting  and  discarding  all  possible  alternative  hypotheses.  The 
process  of  hypothesis  testing  is  therefore  slanted  towards  trying  to  disprove  the  null 
hypothesis,  possibly  in  favor  of  alternative  hypotheses.  The  rejection  of  the  null 
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hypothesis  is  the  only  outcome  of  the  hypothesis  testing  process  that  is  conclusive, 
in  that  it  requires  to  discard  the  hypothesis. 


7.2  The  x2  Distribution 

Consider  N  random  variables  Xt,  each  distributed  like  a  Gaussian  with  mean  /z*, 
variance  cr2,  and  independent  of  one  other.  For  each  variable  Xt,  the  associated  z- 
score 


Xi  i^i 


is  a  standard  Gaussian  of  zero  mean  and  unit  variance.  We  are  interested  in  finding 
the  distribution  function  of  the  random  variable  given  by  the  sum  of  the  square  of 
all  the  deviations, 


N 

Z=J2Zf-  (7-4) 

i=  1 

This  quantity  will  be  called  a  ^-distributed  variable. 

The  reason  for  our  interest  in  this  distribution  will  become  apparent  from  the 
use  of  the  maximum  likelihood  method  in  fitting  two-dimensional  data  (Chap.  8).  In 
fact,  the  sum  of  the  squares  of  the  deviations  of  the  measurements  from  their  mean, 


represents  a  measure  of  how  well  the  measurements  follow  the  expected  values  /z,-. 


7.2.1  The  Probability  Distribution  Function 

The  theoretical  distribution  of  Z  is  obtained  by  making  use  of  the  Gaussian 
distribution  for  its  components.  To  derive  the  distribution  function  of  Z,  we 
first  prove  that  the  moment  generating  function  of  the  square  of  each  Gaussian 
Zt  is  given  by 


MZ2(t)  = 


1-2 1 


(7.5) 
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This  result  enables  the  comparison  with  the  moment  generating  function  of 
another  distribution  and  the  determination  of  the  distribution  function  of  Z. 

Proof  The  moment  generating  function  of  the  square  of  a  standard  Gaussian 
Zi  is  given  by 


Mzf(t)  = 


7  1  X2 

ext—=e~^dx 

\j2jZ 


dx 


We  use  the  fact  that  e  y2 dy  =  *Jtz\  thus,  change  variable  y 2  = 
(1/2  —  t),  and  use  2xdx(l /2  —  t)  =  2 ydy: 


y  dy  _  -y/1/2- 1  _  dy 

x (1/2 —t)~  1/2-t  y~ 


This  results  in  the  following  moment  generating  function  for  T2: 


dy 

X2(\/2-t) 


1-2 1 


(7.6) 


□ 

We  make  use  of  the  property  that  Mx+y(t)  =  Mx(t)  •  My(t)  for  independent 
variables  (4.10).  Since  the  variables  Xt  are  independent  of  one  another,  so  are 
the  variables  Zj .  Therefore,  the  moment  generating  function  of  Z  is  given  by 


To  connect  this  result  with  the  distribution  function  for  Z,  we  need  to  introduce 
the  gamma  distribution : 


fy(r,a)  = 


a(ax)r  1  e  ax 

TV) 


(7.7) 


where  a,  r  are  positive  numbers,  and  v  >  0.  Its  name  derives  from  the 
following  relationship  with  the  Gamma  function: 


e  Xxr  ldx. 


(7.8) 
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For  integer  arguments,  r(n)  —  (n  —  1)!  It  can  be  shown  that  the  mean  of 
the  gamma  distribution  is  /x  =  r/a,  and  the  variance  is  a2  =  r/a2.  From 
property  (7.8),  it  is  also  clear  that  the  gamma  distribution  in  (7.7)  is  properly 
normalized. 

Next,  we  show  that  the  moment  generating  function  of  a  gamma  distribu¬ 
tion  is  a  generalization  of  the  moment  generating  function  of  the  square  of  a 
standard  normal  distribution, 


(7.9) 


Proof  The  moment  generating  function  of  a  gamma  distribution  is  calculated 
as 


e  z(y01  T^dz 


otr  r°° 

7 7T  («  -  t)(a 

roi  Jo 


ty-lzJ-le-z(0,-t}dz. 


The  change  of  variable  x  —  z(a  —  t),  dx  =  dz{ot  —  t)  enables  us  to  use  the 
normalization  property  of  the  gamma  distribution, 


The  results  shown  in  (7.5)  and  (7.9)  prove  that  the  moment  generating  functions 
for  the  Z  and  gamma  distributions  are  related  to  one  another.  This  relationship  can  be 
used  to  conclude  that  the  random  variable  Z  is  distributed  like  a  gamma  distribution 
with  parameters  r  —  N/2  and  a  —  1/2.  The  random  variable  Z  is  usually  referred 
to  as  a  x2  variable  with  N  degrees  of  freedom,  and  has  a  probability  distribution 
function 


fxi{z,N)  =  fz(z)  = 


N/2 


-z/2  n/2-i 

r(N/2) 


(7.11) 


An  example  of  y2  distribution  is  shown  in  Fig.  7.2.  The  distribution  is  unimodal, 
although  not  symmetric  with  respect  to  the  mean. 
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Fig.  7.2  The  Z  statistic  is  a  y2  distribution  with  5  degrees  of  freedom.  The  hatched  area  is  the 
68  %  rejection  region,  and  the  cross-hatched  area  the  95  %  region 


7.2.2  Moments  and  Other  Properties 


Since  the  mean  and  variance  of  a  gamma  distribution  with  parameters  r,  ot ,  are 

r  7  r  9 

li  —  —  and  a  —  — ,  the  y  distribution  has  the  following  moments: 
a  ot 1 


[fi  =  N 
( a2  =  IN. 


(7.12) 


This  result  shows  that  the  expectation  of  a  x2  variable  is  equal  to  the  number  of 
degrees  of  freedom.  It  is  common  to  use  the  reduced  j2  square  variable  defined  by 

XL  =  JX  (7-13) 

The  mean  or  expectation  of  the  reduced  /2  and  the  variance  are  therefore  given  by 


/i  —  1 

2  (reduced/2)  (7.14) 

a1  — 

N 


As  a  result,  the  ratio  between  the  standard  deviation  and  the  mean  for  the  reduced 
/2,  a  measure  of  the  spread  of  the  distribution,  decreases  with  the  number  of  degrees 
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of  freedom, 


G 
l± 

As  the  numbers  of  degrees  of  freedom  increase,  the  values  of  the  reduced  j2  are 
more  closely  distributed  around  1 . 

As  derived  earlier,  the  moment  generating  function  of  the  /2  distribution 


Mx2(t)=E[e'z]=  (1_j,fyv/2.  (7-15) 

This  form  of  the  moment  generating  function  highlights  the  property  that,  if  two 
independent  / 2  distributions  have,  respectively,  N  and  M  degrees  of  freedom,  then 
the  sum  of  the  two  variables  will  also  be  a  x2  variable,  and  it  will  have  N+M  degrees 
of  freedom.  In  fact,  the  generating  function  of  the  sum  of  independent  variables  is 
the  product  of  the  two  functions,  and  the  exponents  in  (7.15)  will  add. 


7.2.3  Hypothesis  Testing 

The  null  hypothesis  for  a  /2  distribution  is  that  all  measurements  are  consistent 
with  the  parent  Gaussians.  Under  this  hypothesis,  we  have  derived  the  probability 
distribution  function  fpiz,  A/),  where  N  is  the  number  of  degrees  of  freedom  of  the 
distribution.  If  the  N  measurements  are  consistent  with  their  parent  distributions, 
one  expects  a  value  of  approximately  j2  —  he.,  each  of  the  N  measurements 
contributes  approximately  a  value  of  one  to  the  /2.  Large  values  of  j2  clearly 
indicate  that  some  of  the  measurements  are  not  consistent  with  the  parent  Gaussian, 
i.e.,  some  of  the  measurements  X[  differ  by  several  standard  deviations  from  the 
expected  mean,  either  in  defect  or  in  excess.  Likewise,  values  of  /2  N  are 
also  not  expected.  Consider,  for  example,  the  extreme  case  of  N  measurements  all 
identical  to  the  parent  mean,  resulting  in  /2  =  0.  Statistical  fluctuations  of  the 
random  variables  make  it  extremely  unlikely  that  all  N  measurements  match  the 
mean.  Clearly  such  an  extreme  case  of  perfect  agreement  between  the  data  and  the 
parent  model  is  suspicious  and  the  data  should  be  checked  for  possible  errors  in  the 
collection  or  analysis. 

Despite  the  fact  that  very  small  value  of  j2  is  unlikely,  it  is  customary  to  test  for 
the  agreement  between  a  measurement  of  /2  and  its  theoretical  distribution  using  a 
one-sided  rejection  region  consisting  of  values  of  /2  exceeding  a  critical  value.  This 
means  that  the  acceptable  region  is  for  values  of  / 2  that  are  between  zero  and  the 
critical  value.  Critical  values  of  the  j2  distribution  for  a  confidence  level  p  can  be 
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calculated  via 


pOO 

P(z  >  xlrit )  =  /  fxi(z,n)dz  =  l -p  (7.16) 

JXcrit 

and  are  tabulated  in  Table  A. 7. 

Example  7.3  Assume  the  N  —  5  measurements  of  a  variable  X,  (10,12,15,11,13), 
presented  in  Example  7.2.  We  want  to  test  the  hypothesis  that  these  were  inde¬ 
pendent  measurements  of  a  Gaussian  variable  X  of  mean  fi  —  13  and  variance 
a2  =  2.  Under  this  assumption,  we  could  use  the  j1  statistic  to  try  and  falsify  the 
null  hypothesis  that  the  data  are  drawn  from  the  given  Gaussian.  The  procedure  for 
a  quantitative  answer  to  this  hypothesis  is  that  of  deciding  a  level  of  probability  p , 
then  to  calculate  the  value  of  the  statistic, 

x2  =  l/2-((10— 13)2  +  (12— 13)2  +  (15— 13)2  +  (11  — 13)2  +  (13— 13)2)  =  9. 


In  Fig.  7.2  we  show  the  rejection  regions  for  a  probability  p  —  0.95  and  p  — 
0.68,  which  are  determined  according  to  the  tabulation  of  the  integral  of  the  /2 
distribution  with  N  —  5  degrees  of  freedom:  /2nV  =  6.1  marks  the  beginning  of 
the  70%  rejection  region,  and  /2r;Y  =  11.1  that  of  the  95  %  rejection  region.  The 
hypothesis  is  therefore  rejected  at  the  68  %  probability  level,  but  cannot  be  rejected 
at  the  95  %  confidence  level. 

Moreover,  we  calculate  from  Table  A. 7 

POO 

P(X2>  9)  =  /  fpz,  5)dz  ~  0.10. 

J  9 

We  therefore  conclude  that  there  is  a  10  %  probability  of  observing  such  value  of  /2, 
or  higher,  under  the  hypothesis  that  the  measurements  were  made  from  a  Gaussian 
distribution  of  such  mean  and  variance  (see  Fig.  7.2).  Notice  that  the  results  obtained 
using  the  /2  distribution  are  similar  to  those  obtained  with  the  test  that  made  use  of 
the  sum  of  the  five  measurements.  O 


7.3  The  Sampling  Distribution  of  the  Variance 

The  distribution  function  of  the  sample  variance,  or  sampling  distribution  of  the 
variance ,  is  useful  to  compare  a  given  measurement  of  the  sample  variance  s1  with 
the  parent  variance  a2.  We  consider  N  measurements  of  X  that  are  distributed  like 
a  Gaussian  of  mean  /x,  variance  a2  and  independent  of  each  other.  The  variable  S2 
defined  by 


N 

S2  =  (N-  1>2  = 

1=1 


(7.17) 
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is  proportional  to  the  sample  variance  s2.  We  seek  a  distribution  function  for  S2/a2 
that  enables  a  comparison  of  the  measured  sample  variance  with  the  parent  variance 
a2. 

In  determining  the  sampling  distribution  of  the  variance  we  do  not  want  to 
assume  that  the  mean  of  the  parent  Gaussian  is  known,  as  we  did  in  the  previous 
section  for  the  determination  of  the  j2  distribution.  This  is  important,  since  in  a 
typical  experiment  we  do  not  know  a  priori  the  parent  mean  of  the  distribution,  but 
we  can  easily  calculate  the  sample  mean.  One  complication  in  the  use  of  (7.17) 
is  therefore  that  X  is  itself  a  random  variable,  and  not  an  exactly  known  quantity. 
This  fact  must  be  taken  into  account  when  calculating  the  expectation  of  S2.  A 
measurement  of  S2  is  equal  to 

N  N 

S 2  =  —  fi  +  ji  —  x)2  —  —  /x)2  —  N(fi  —  x)2.  (7.18) 

i=  1  i=  1 


Dividing  both  terms  by  a2,  we  obtain  the  following  result: 


Ef=  ife-M)2  =  S2  (*  -  /x)2 

a2  a2  a2/N 


(7.19) 


According  to  the  result  in  Sect.  7.2,  the  left-hand  side  term  is  distributed  like  a 
X2  variable  with  N  degrees  of  freedom,  since  the  parent  mean  /x  and  variance  a2 
appear  in  the  sum  of  squares.  For  the  same  reason,  the  second  term  in  the  right- 
hand  side  is  also  distributed  like  a  / 2  variable  with  1  degree  of  freedom,  since  we 
have  already  determined  that  the  sample  mean  X  is  distributed  like  a  Gaussian  with 
mean  /x  and  with  variance  a2 /N.  Although  it  may  not  be  apparent  at  first  sight,  it 
can  be  proven  that  the  two  terms  on  the  right-hand  side  are  two  independent  random 
variables.  If  we  can  establish  the  independence  between  these  two  variables,  then  it 
must  be  true  that  the  first  variable  in  the  right-hand  side,  S2/a 2,  is  also  distributed 
like  a  / 2  distribution  with  N  —  1  degrees  of  freedom.  This  follows  from  the  fact  that 
the  sum  of  two  independent  / 2  variables  is  also  a  / 2  variable  featuring  the  sum  of 
the  degrees  of  freedom  of  the  two  variables,  as  shown  in  Sect.  7.2. 

Proof  The  proof  of  the  independence  between  S2 / o2  and  ^x~2^  ,  and  the  fact 

that  both  are  distributed  like  x2  distributions  with,  respectively,  N  —  1  and  1 
degrees  of  freedom,  can  be  obtained  by  making  a  suitable  change  of  variables 
from  the  original  N  standard  normal  variables  that  appear  in  the  left-hand  side 
of  (7.19), 
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to  a  new  set  of  N  variables  Yt.  The  desired  transformation  is  one  that  has  the 
property 


z2  + . . .  +  Zn  —  f2  + . . .  +  Yff. 

This  is  called  an  orthonormal  (linear)  transformation,  and  in  matrix  form  it 
can  be  expressed  by  a  transformation  matrix  A,  of  dimensions  N  x  N,  such 
that  a  row  vector  z  =  (Z\, ... ,  ZN)  is  transformed  into  another  vector  y 
by  way  of  the  product  y  —  zA.  For  such  a  transformation,  the  dot  product 
between  two  vectors  is  expressed  as  yyT  =  zAATzT .  Since  for  an  orthonormal 
transformation  the  relationship  AAT  =  I  holds,  where  I  is  the  N  x  N  identity 
matrix,  then  the  dot  product  remains  constant  upon  this  transformation.  An 
orthonormal  transformation,  expressed  in  extended  form  as 

Y\  —  d\Z\  +  . . .  +  cl^Zn 
\Y2  —  b\Z\  +  . . .  +  b^Z^ 


is  obtained  when,  for  each  row  vector,  Y  —  U  and,  for  any  pair  of  row 
vectors,  ai^i  —  0^  so  that  the  Yf  s  are  independent  of  one  another. 

Any  such  orthonormal  transformation,  when  applied  to  N  independent 
variables  that  are  standard  Gaussians,  Zt  ~  N( 0, 1),  as  is  the  case  in  this 
application,  is  such  that  the  transformed  variables  Yt  are  also  independent 
standard  Gaussians.  In  fact,  the  joint  probability  distribution  function  of  the 
Z[  s  can  be  written  as 


l  zi+-+4 

/(Z)  =  ( Inf C  2  : 

and,  since  the  transformed  variables  have  the  same  dot  product,  z\  + . . .  +z2n  - 
y \  +  . . .  +  y2N,  the  N  variables  Yt  have  the  same  joint  distribution  function, 
proving  that  they  are  also  independent  standard  Gaussians. 

We  want  to  use  these  general  properties  of  orthonormal  transformations  to 
find  a  transformation  that  will  enable  a  proof  of  the  independence  between 
S2 / g2  and  ( X  —  /z)2 / a2.  The  first  variable  is  defined  by  the  following  linear 
combination, 
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in  such  a  way  that  the  following  relationships  hold: 


V-  =  {X~iiy 


1 

N 


<j2/N 


J2zf  =  -4  J2(.xi~V)2’  or 

i=  1  a  z=l 

AT  W 

Ez^y2  +  EF2_ 

/=  1  i=2 


The  other  TV  —  1  variables  Y^ , . . . ,  YN  can  be  chosen  arbitrarily,  provided  they 


N 


satisfy  the  requirements  of  orthonormality.  Since  Zf  —  Y2  =  S2 / <7“ ,  we  can 

i=  1 

conclude  that 


a 


2 


z=2 


proving  that  S2/o 2  is  distributed  like  a  j2  distribution  with  N  —  1  degrees  of 
freedom,  as  the  sum  of  squares  on  N  —  1  independent  standard  Gaussians,  and 
that  S2 / a2  is  independent  of  the  sampling  distribution  of  the  mean,  Y2,  since 
the  variables  Yt  are  independent  of  each  other.  This  proof  is  due  to  Bulmer  [7], 
who  used  a  derivation  done  earlier  by  Helmert  [20] .  □ 

We  are  therefore  able  to  conclude  that  the  ratio  S2/o 2  is  distributed  like  a  j2 
variable  with  TV  —  1  degrees  of  freedom, 


S 2  9 

x2(N-l)_  (7.20) 

Gl 


The  difference  between  the  j2  distribution  (7.11)  and  the  distribution  of  the 
sample  variance  (7.20)  is  that  in  the  latter  case  the  mean  of  the  parent  distribution  is 
not  assumed  to  be  known,  but  it  is  calculated  from  the  data.  This  is  in  fact  the  more 
common  situation,  and  therefore  when  N  measurements  are  obtained,  the  quantity 


a 


2 


E 


2 


is  distributed  like  a  j2  distribution  with  just  N  —  1  degrees  of  freedom,  not  N.  This 
reduction  in  the  number  of  degrees  of  freedom  can  be  expressed  by  saying  that  one 
degree  of  freedom  is  being  used  to  estimate  the  mean. 
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Example  7.4  Assume  N  —  10  measurements  of  a  given  quantity  (10,  12,  15,  11, 
13,  16,  12,  10,  18,  13).  We  want  to  answer  the  following  question:  Are  these 
measurements  consistent  with  being  drawn  from  the  same  Gaussian  random  variable 
with  a2  =  2?  If  the  measurements  are  in  fact  derived  from  the  same  variable,  then 
the  probability  of  measuring  the  actual  value  of  s2  for  the  sample  variance  will  be 
consistent  with  its  theoretical  distribution  that  was  just  derived  in  (7.20). 

The  value  of  the  sample  variance  is  obtained  by  T  =  13  as  S2  —  62.  Therefore, 
the  measurement  s2/<j2  =  62/2  =  36  must  be  compared  with  the  j2  distribution 
with  —  1  =  9  degrees  of  freedom.  The  measurement  is  equivalent  to  a  reduced  j2 
value  of  4,  which  is  inconsistent  with  a  j2  distribution  with  9  degrees  of  freedom 
at  more  than  the  99  %  confidence  level.  We  therefore  conclude  that  the  hypothesis 
must  be  rejected  with  this  confidence  level. 

It  is  necessary  to  point  out  that,  in  this  calculation,  we  assumed  that  the  parent 
variance  was  known.  In  the  following  section  we  will  provide  another  test  that  can  be 
used  to  compare  two  measurements  of  the  variance  that  does  not  require  knowledge 
of  the  parent  variance.  That  is  in  fact  the  more  common  experimental  situation  and 
it  requires  a  detailed  study.  <> 


7.4  The  F  Statistic 

The  distribution  of  the  sample  variance  discussed  above  in  Sect.  7.3  shows  that  if 
the  actual  variance  a  is  not  known,  then  it  is  impossible  to  make  a  quantitative 
comparison  of  the  sample  variance  with  the  parent  distribution.  Alternatively,  one 
can  compare  two  different  measurements  of  the  variance,  and  ask  the  associated 
question  of  whether  the  ratio  between  the  two  measurements  is  reasonable.  In  this 
case  the  parent  variance  a2  drop  out  of  the  equation  and  the  parent  variance  is  not 
required  to  compare  two  measurements  of  the  sample  variance. 

For  this  purpose,  consider  two  independent  random  variables  Z\  and  Z2,  respec¬ 
tively,  distributed  like  a  j2  distribution  with/i  and/2  degrees  of  freedom.  We  define 
the  random  variable  F  as 


Zi/fi 

Z2//2 ' 


(7.21) 


The  variable  F  is  equivalent  to  the  ratio  of  two  reduced  /2,  and  therefore  is  expected 
to  have  values  close  to  unity. 
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7.4.1  The  Probability  Distribution  Function 

We  show  that  the  probability  distribution  function  of  the  random  variable  F  is 
given  by 


If(z)  = 


(7.22) 


Proof  The  proof  makes  use  of  the  methods  described  in  Sects.  4.4.1  and  4.4.2. 
First  we  derive  the  distribution  functions  of  the  numerator  and  denominator  of 
(7.21),  and  then  we  calculate  the  distribution  function  for  the  ratio  of  two 
variables  with  known  distribution. 

Given  that  Z\  x2(fi)  and  Z2  ~  X2(/2),  the  distribution  functions  of 
X '  —  Z\  /f\  and  Y'  —  Z^tfo  are  found  using  change  of  variables;  forX7, 

fx'(x')  =f(z)f7  =f(z)fu 

ax 

where /(z)  is  the  distribution  of  Z\ .  This  results  in 

,  (. X'/X^e 

jx'(x)  r(fi/2)2f'/2^1  r(fi/2)2f 1/2  ■fu 

same  transformation  applies  to  Y' .  Now  we  can  use  (4.18), 


P  00 

fr(z)  =  /  fx'izmHKW 

Jo 


-L 


M 


(-4fXl2~le-m)/2 

r(/i/2)2^>/2 

r(/1/2)2W+.ft)/2r(/2/2)  J0 
r(/1/2)r(/2/2)2(/i+/2)/2  J0 


(t/2  y^2~ie~^2 


r(/2/2)2/2/2 


dt, 


— 1+/2/2— 1+1  — 


e-V%(zfi+fi) 


n  OO 

/  i/i,! 

Jo 

poo 

/  ^(/i+/2)/2-1^-1M(z/i+/2)^ 

JO 


7.4  The  F  Statistic 


133 


After  another  change  of  variables,  t  —  £(. zf\  +fi)/2,  dt  —  +  / 2). / 2, 

the  integral  becomes 


t dt 


2C/1+/2)/2  r°° 

Wx  +/2)l  +  (/1+/2)/2-l  i0  ' 

2(/i+/2)/2  //,  +/2 

(zfx  +/2)</i+/2)/2  V  2 


(fi+fl)/2-le-tdt 


Therefore  the  distribution  of  Z  is  given  by 


/f(z)  = 


2</i+/2)/2r 


r(/i/2)r(/2/2)2W+/2)/2  ^  +/2)(/i+/2)/2 


r/i nj2nr  ffx+f2\ 

h  h  1  y  2  ) 

r(fl/2)r(f2/2)  ✓ J  _|_  Jj_yj\+f2)/2AJ\+f2)/2 

fl  2 


(/1+/2)/2' 


□ 

The  distribution  of  F  is  known  as  the  F  distribution.  It  is  named  after  Fisher  [13], 
who  was  the  first  to  study  it. 


7.4.2  Moments  and  Other  Properties 


The  mean  and  higher-order  moments  of  the  F  distribution  can  be  calculated  by 
making  use  of  the  Beta  function, 


f 


'X—\ 


(l  +  t)x+y 


dt  — 


/woo 

r(x  +  y)  ’ 


(7.23) 


134 


7  Hypothesis  Testing  and  Statistics 


to  find  that 


ji  — 


h 


(/2  >  2) 


(J  = 


fi~2 

2fHfi  +fi  -  2) 


/i(/2-2)2(/2-4) 


(/2  >  4) 


(7.24) 


The  mean  is  approximately  1,  provided  that/2  is  not  too  small. 

It  is  possible  to  find  an  approximation  to  the  F  distribution  when  either  f\  or/2  is 
a  large  number: 


lim  /f(z,/i,/2)  —  fx2(x,fi) 

fl^OO 

lim  /f(z,/i,/2)  =  ff  (x,f2) 


where  x  —  f\z 
where  v  =f2/z- 


(7.25) 


The  approximation,  discussed,  for  example,  in  [1],  is  very  convenient,  since  it 
overcomes  the  problems  with  the  evaluation  of  the  Gamma  function  for  large 
numbers. 


7.4.3  Hypothesis  Testing 

The  F  statistic  is  a  ratio 


x\/f\ 

X\lh 


(7.26) 


between  two  independent  measurements  of,  respectively,  f\  and  fz  degrees  of 
freedom.  A  typical  application  of  the  F  test  is  the  comparison  of  two  /2  statistics 
from  independent  datasets  using  the  parent  Gaussians  as  models  for  the  data.  The 
null  hypothesis  is  that  both  sets  of  measurements  follow  the  respective  Gaussian 
distribution.  In  this  case,  the  measured  ratio  F  will  follow  the  F  distribution. 
This  implies  that  the  measured  value  of  F  should  not  be  too  large  under  the  null 
hypothesis  that  both  measurements  follow  the  parent  models. 

It  is  customary  to  do  hypothesis  testing  of  an  F  distribution  using  a  one-sided 
rejection  region  above  a  critical  value.  The  critical  value  at  confidence  level  p  is 
calculated  via 


P(F  >  Fcm) 


poo 

7  F crit 


fF(z)dz  =  1  —p. 


(7.27) 


Critical  values  are  tabulated  in  Table  A. 8  for  the  case  of  fixed/!  =  1 ,  and  Tables  A. 9, 
A. 10,  A. 11,  A. 12,  A. 13,  A. 14,  and  A. 15  for  various  values  of  p ,  and  as  function  of 
/i  and/2.  The  values  of  Fcrit  calculated  from  (7.27)  indicate  how  high  a  value  of  the 
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F  statistic  can  be,  and  still  be  consistent  with  the  hypothesis  that  the  two  quantities 
at  the  numerator  and  denominator  are  -distributed  variables. 

The  approximations  for  the  F  distribution  in  (7.25)  can  be  used  to  calculate 
critical  values  when  one  of  the  degrees  of  freedom  is  very  large.  For  example,  the 
critical  value  of  F  at  90%  confidence,  p  =  0.90,  for/i  =  100  and/2  ->  00  (e.g., 
Table  A.  13)  is  calculated  from  Table  A.7  as  F  =  1.185.  Note  that  Table  A.7  reports 
the  value  of  the  reduced  /2,  or  z  in  the  notation  of  the  top  equation  in  (7.25). 

Example  7.5  Consider  the  data  set  composed  of  the  ten  measurements 

(10, 12, 15, 11, 13, 16, 12, 10, 18, 13). 

We  assume  that  the  measurements  follow  a  Gaussian  distribution  of  mean  of 
l±  —  13  and  variance  a2.  The  goal  is  to  compare  the  calculation  of  the  j2  of  the 
first  five  measurements  with  the  last  five  to  address  whether  both  subsets  are  equally 
likely  to  be  described  by  the  same  Gaussian. 

We  obtain  /j  =  18/ a2  and  x\  —  44/a2,  respectively,  for  the  first  and  the 
second  set  of  five  measurements.  Both  variables,  under  the  null  hypothesis  that  the 
measurements  follow  the  reference  Gaussian,  are  distributed  like  /2  with  5  degrees 
of  freedom  (since  both  mean  and  variance  are  assumed  to  be  known).  We  therefore 
can  calculate  an  F  statistic  of  F  —  44/18  =  2.44.  For  simplicity,  we  have  placed 
the  initial  five  measurements  at  the  denominator. 

In  the  process  of  calculating  the  F  statistic,  the  variances  a2  cancel,  and  therefore 
the  null  hypothesis  is  that  of  a  mean  of  /x  =  13  and  same  variance  for  both  sets, 
regardless  of  its  value.  In  Fig.  7.3  we  plot  the  F  distribution  for/i  =  5  and/2  =  5 


Fig.  7.3  Solid  curve  is  the  F  distribution  with  v\  =  5,  v2  =  5  degrees  of  freedom;  the  hatched 
area  is  the  75  %  rejection  region,  and  the  cross-hatched  area  is  the  90  %  rejection  region.  For 
comparison,  the  F  distribution  with  v\  =  4,  V2  =  4  degrees  of  freedom  is  shown  as  the  dashed 
line ,  and  the  two  rejection  regions  are  outlined  in  green  and  red ,  respectively.  The  rejection  region 
for  the  F  distribution  with  v\  =  4,  v2  =  4  degrees  of  freedom  is  shifted  to  higher  values,  relative 
to  that  with  v\  =  5,  v2  =  5  degrees  of  freedom,  because  of  its  heavier  tail 
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as  the  solid  line,  and  its  75  and  90%  rejection  regions,  marked,  respectively,  by 
the  critical  values  F  =  1.89  and  3.45,  as  hatched  and  cross-hatched  areas.  The 
measurements  are  therefore  consistent  with  the  null  hypothesis  at  the  90  %  level, 
but  the  null  hypothesis  must  be  discarded  at  the  75  %  confidence  level.  Clearly  the 
first  set  of  five  numbers  follows  the  parent  Gaussian  more  closely  than  the  second 
set.  Yet,  there  is  a  reasonable  chance  (>  10  %)  that  both  sets  follow  the  Gaussian. 

If  the  parent  variance  was  given,  say  a2  =  4,  we  could  have  tested  both  subsets 
independently  for  the  hypothesis  that  they  follow  a  Gaussian  of  mean  /x  =  1 3  and 
variance  a2  =  4  using  the  /2  distribution.  The  two  measurements  are  x\  =  4.5  and 
X2  =  11  for  5  degrees  of  freedom.  Assuming  a  confidence  level  of  p  =  0.9,  the 
critical  value  of  the  /2  distribution  is  /2  it  =  9.2.  At  this  confidence  level,  we  would 
reject  the  null  hypothesis  for  the  second  measurement.  O 

The  ratio  between  two  measurements  of  the  sample  variance  follows  the  F 
distribution.  For  two  independent  sets  of,  respectively,  N  and  M  measurements,  the 
sample  variances  s\  and  s\  are  related  to  the  parent  variances  of  and  of  of  the 
Gaussian  models  via 


2L 

_  Z1//1  =  aft 
Z2//2  52  ’ 

oft 

where 

(VT  =  w-D.s'^yy^.v,-!)2 

\sl  =  {M-\)sl  =  TX^(yj-y)2- 


(7.28) 


(7.29) 


The  quantities  Z\  —  S\/o\  and  Z2  =  S\/o\  are  /2 -distributed  variables  with, 
respectively,/!  =  N  —  1  and/2  —  M  —  1  degrees  of  freedom.  The  statistic  F  can  be 
used  to  test  whether  both  measurements  of  the  variance  are  equally  likely  to  have 
come  from  the  respective  models. 

The  interesting  case  is  clearly  when  the  two  variances  are  equal,  of  =  of,  so  that 
the  value  of  the  variance  drops  out  of  the  equation  and  the  F  statistic  becomes 


(7.30) 


In  this  case,  the  null  hypothesis  becomes  that  the  two  samples  are  Gaussian 
distributed,  regardless  of  values  for  the  mean  and  the  variance.  The  statistic  therefore 
measure  if  the  variances  or  variability  of  the  data  in  the  two  measurements  are 
consistent  with  one  another  or  if  one  measurement  has  a  sample  variance  that  is 
significantly  larger  than  the  other.  If  the  value  of  F  exceeds  the  critical  value,  then 
the  null  hypothesis  must  be  rejected  and  the  conclusion  is  that  the  measurement  with 
the  largest  value  of  Z//,  which  is  placed  at  the  numerator,  is  not  as  likely  to  have 
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come  from  the  parent  model  as  the  other  set.  This  type  of  analysis  will  have  specific 
applications  to  model  fitting  in  Chap.  13. 

Example  7.6  Using  the  same  data  as  in  the  previous  example,  we  can  calculate  the 
sample  variance  using  the  sample  mean  for  each  of  the  two  5 -measurement  sets.  We 
calculate  a  sample  mean  of  x\  —  12.2  and  X2  =  13.8,  for  a  value  of  S2  =  14.8  and 
S\  —  40.8,  for  a  ratio  of  F  —  2.76.  Given  that  the  sample  mean  was  estimated  from 
the  data,  the  null  hypothesis  is  that  both  sets  are  drawn  from  the  same  Gaussian 
distribution,  without  specification  of  the  value  of  either  variance  or  mean,  and  each 
measurement  of  S2 /o2  is  distributed  now  like  a  j2  variable  with  just  4  degrees  of 
freedom  (and  not  5).  The  value  of  the  F  statistic  must  therefore  be  compared  with 
an  F  distribution  with/i  =  4  and/2  =  4  degrees  of  freedom,  reported  in  Fig.  7.3 
as  a  dashed  line.  The  75  and  90%  rejection  regions,  marked,  respectively,  by  the 
critical  values  F  —  2.06  and  4.1,  are  outlined  in  green  and  red,  respectively.  The 
measurements  are  therefore  consistent  at  the  90  %  confidence  level,  but  not  at  the 
75  %  level. 

We  conclude  that  there  is  at  least  a  10  %  probability  that  the  two  measurements 
of  the  variance  are  consistent  with  one  another.  At  the  p  —  0.9  level  we  therefore 
cannot  reject  the  null  hypothesis.  <> 


7.5  The  Sampling  Distribution  of  the  Mean 
and  the  Student’s  t  Distribution 

In  many  experimental  situations  we  want  to  compare  the  sample  mean  obtained 
from  the  data  to  a  parent  mean  based  on  theoretical  considerations.  Other  times 
we  want  to  compare  two  sample  means  to  one  another.  The  question  we  answer 
in  this  section  is  how  the  sample  mean  is  expected  to  vary  when  estimated  from 
independent  samples  of  size  N. 


7.5.1  Comparison  of  Sample  Mean  with  Parent  Mean 

For  measurements  of  a  Gaussian  variable  of  mean  p  and  variance  a2,  the  sample 
mean  v  is  distributed  as  a  Gaussian  of  mean  \i  and  variance  o2 /N.  Therefore,  if  both 
the  mean  and  the  variance  of  the  parent  distribution  are  known,  the  sample  mean  X 
is  such  that 


X  —  fi 


~N(  0, 1). 
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A  simple  comparison  between  the  z-score  of  the  sample  mean  to  the  N( 0, 1) 
Gaussian  therefore  addresses  the  consistency  between  the  measurement  and  the 
model. 

Example  7. 7  Continue  with  the  example  of  the  five  measurements  of  a  random 
variable  (10, 12, 15, 11, 13),  assumed  to  be  distributed  like  a  Gaussian  of  /x  =  13 
and  a2  =  2.  Assuming  knowledge  of  the  parent  mean  and  variance,  the  z-score  of 
the  sample  mean  is 


x  —  Li  12.2  —  13 

^  =  -1.27. 

According  to  Table  A. 2,  there  is  a  probability  of  about  20  %  to  exceed  the  absolute 
value  of  this  measurement  according  to  the  parent  distribution  N( 0,  1).  Therefore 
the  null  hypothesis  that  the  measurements  are  distributed  like  a  Gaussian  of  /x  =  13 
and  a2  =  2  cannot  be  rejected  at  the  90  %  confidence  level.  Notice  that  this  is  the 
same  probability  as  obtained  by  using  the  sum  of  the  five  measurements,  instead 
of  the  average.  This  was  to  be  expected,  since  the  mean  differs  from  the  sum  by  a 
constant  value,  and  therefore  the  two  statistics  are  equivalent.  O 

A  more  common  situation  is  when  the  mean  /x  of  the  parent  distribution  is  known 
but  the  parent  variance  is  unknown.  In  those  cases  the  parent  variance  can  only  be 
estimated  from  the  data  themselves  via  the  sample  variance  s2  and  one  needs  to 
allow  for  such  uncertainty  when  estimating  the  distribution  of  the  sample  mean. 
This  additional  uncertainty  leads  to  a  deviation  of  the  distribution  function  from  the 
simple  Gaussian  shape.  We  therefore  seek  to  find  the  distribution  of 


x  —  11 

T  =  - — 

s/  y/n 


(7.31) 


in  which  we  define  the  sample  variance  in  such  a  way  that  it  is  an  unbiased  estimator 
of  the  parent  variance, 


N  — 


y  7N'  ~  ~x)1 


S1 

n-  r 


The  variable  T  can  be  written  as 


x  —  ll 

X  —  11 

/  (  o  /  rx  \  — 

X  —  [I 

/ 

r  s2  i 

s/VN  ~ 

.a/ In. 

/\S/G)  — 

.o/Vn. 

/ 

— i 

(N 

1 

_ i 

(7.32) 
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in  which  S2  is  the  sum  of  the  squares  of  the  deviations  from  the  sample  mean.  As 
shown  in  previous  sections, 


'  x  —  jl 

a/V" 

S 2 


~  N((), 1) 

~  X2(N-  1). 


We  therefore  need  to  determine  the  distribution  function  of  the  ratio  of  these  two 
variables.  We  will  show  that  a  random  variable  T  defined  by  the  ratio 


(7.33) 


in  which  X  ~  N( 0,  1)  and  Z  ~  /2(/)  (a  distribution  with  /  degrees  of  freedom) 
is  said  to  be  distributed  like  a  t  distribution  with /  degrees  of  freedom: 


frit)  — 


i  r((/+  i)/2) 

VN  r(f/2) 


/  + 1 
2 


(7.34) 


Proof  The  proof  of  (7.34)  follows  the  same  method  as  that  of  the  F  distribu¬ 
tion.  First,  we  can  derive  the  distribution  function  of  Y  —  y/zjf  using  the 
usual  method  of  change  of  variables, 

g(y)  =  =  h(z)2y/fZ 


where 


//2-  lc-z/2 

_  2f/2r(f/2)' 

Therefore  the  distribution  of  Y  is  given  by  substituting  z  =  fy 2  into  the  first 
equation, 


2//2-1r(//2) 


The  distribution  function  of  the  numerator  of  (7.33)  is  simply 


(7.35) 


140 


7  Hypothesis  Testing  and  Statistics 


and  therefore  the  distribution  of  T  is  given  by  applying  (4.18), 


1 

•Jin  J  2//2-1r(//2) 


(7.36) 


The  integral  can  be  shown  to  be  equal  to  (7.34)  following  a  few  steps  of 
integration  as  in  the  case  of  the  F  distribution.  □ 

This  distribution  is  symmetric  and  has  a  mean  of  zero,  and  it  goes  under  the  name 
of  Student’s  t  distribution.  This  distribution  was  studied  first  by  Gosset  in  1908  [18], 
who  published  a  paper  on  the  subject  under  the  pseudonym  of  “Student.” 

The  random  variable  T  defined  in  (7.31)  therefore  is  distributed  like  a  t  variable 
with  N  —  1  degrees  of  freedom.  It  is  important  to  notice  the  difference  between 
the  sample  distribution  of  the  mean  in  the  case  in  which  the  variance  is  known, 
which  is  Af(0, 1),  and  the  t  distribution.  In  particular,  the  latter  depends  on  the 
number  of  measurements,  while  the  former  does  not.  One  expects  that,  in  the 
limit  of  a  large  number  of  measurements,  the  t  distribution  tends  to  the  standard 
normal  (see  Problem  7.10).  The  t  distribution  has  in  fact  broader  wings  than  the 
standard  Gaussian,  and  in  the  limit  of  an  infinite  number  of  degrees  of  freedom, 
the  two  distributions  are  identical;  an  example  of  the  comparison  between  the 
two  distributions  is  shown  in  Fig.  7.4.  The  t  distribution  has  heavier  tails  than  the 
Gaussian  distribution,  indicative  of  the  additional  uncertainty  associated  with  the 
fact  that  the  variance  is  estimated  from  the  data  and  not  known  a  priori. 


Fig.  7.4  Student’s  t 
distribution  with /  =  4 
degrees  of  freedom.  The 
dashed  curve  is  the  AfiO,  1) 
Gaussian,  to  which  the 
/-distribution  tends  for  a  large 
number  of  degrees  of 
freedom.  The  hatched  area  is 
the  68  %  rejection  region 
(compare  to  the  ±lcr  region 
for  the  AfiO,  1)  distribution) 
and  the  cross-hatched  area  is 
the  95  %  region  (compare  to 
ihl.95cr  for  theAfiO,  1) 
distribution) 


Statistic  T 
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7.5. 1.1  Hypothesis  Testing 

Hypothesis  testing  with  the  t  distribution  typically  uses  a  two-sided  rejection  region. 
After  obtaining  a  measurement  of  the  t  variable  from  a  given  dataset,  we  are  usually 
interested  in  knowing  how  far  the  measurement  can  be  from  the  expected  mean  of  0 
and  still  be  consistent  with  the  parent  distribution.  The  critical  value  for  a  confidence 
level  p  is  calculated  via 


/T crit 

Mt)dt  =  p  (7.37) 

”  Ter  it 

and  it  is  a  function  of  the  number  of  degrees  of  freedom  for  the  t  distribution. 
Tables  A.  16,  A.  17,  A.  18,  A.  19,  A. 20,  A. 21,  and  A. 22  report  the  value  of  p  as 
function  of  the  critical  value  Tcrit  for  selected  degrees  of  freedom,  and  Table  A. 23 
compares  the  t  distribution  with  the  standard  Gaussian. 

Example  7.8  Assume  now  that  the  five  measurements  (10, 12, 15, 11, 13)  are  dis¬ 
tributed  like  a  Gaussian  of  /x  =  13,  but  without  reference  to  a  parent  variance.  In 
this  case  we  consider  the  t  statistic  and  start  by  calculating  the  sample  variance: 

s2  =  -  y —x)2  =  3.7. 

With  this  we  can  now  calculate  the  t  statistic, 


x  —  pi 

s/V5 


12.2-  13 
1.92/V5 


-0.93. 


This  value  of  t  corresponds  to  a  probability  of  approximately  ~40%  to  exceed 
the  absolute  value  of  this  measurement,  using  the  t  distribution  with  4  degrees  of 
freedom  of  Table  A. 23.  It  is  clear  that  the  estimation  of  the  variance  from  the  data 
has  added  a  source  of  uncertainty  in  the  comparison  of  the  measurement  with  the 
parent  distribution.  <> 


7.5.2  Comparison  of  Two  Sample  Means  and  Hypothesis 
Testing 

The  same  distribution  function  is  also  applicable  to  the  comparison  between  two 
sample  means  x[  and  TJ,  derived  from  samples  of  size  N\  and  N2,  respectively.  In 
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this  case,  we  define  the  following  statistic: 


x\  —x2 

Sy/l/Ni  +  \JN~2 


(7.38) 


where 


Ni+N2-2 


s?  =  E(*<-*  1) 

i=  1 

N2 

sl  =  EC*/-* 2). 
/=! 


We  show  that  this  statistic  is  distributed  like  a  T  distribution  with/  =  N\  +  N2  —  2 
degrees  of  freedom,  and  therefore  we  can  use  the  same  distribution  also  for  testing 
the  agreement  between  two  sample  means. 

Proof  Under  the  hypothesis  that  all  measurements  are  drawn  from  the  same 
parent  distribution,  X  ~  N(/x,  cr),  we  know  that 


r  X\  — 11 

oj_s/m 

X2  ~  /-/ 

Kcr/VN2 


~  Af(0,  1) 
~  A/(0, 1) 


and,  from  (7.20) 


i) 
i) 

First,  we  find  the  distribution  function  for  the  variable  (xi  —  /i)  /  0  —  (X2 — 
li)/(j.  Assuming  that  the  measurements  are  independent,  then  the  variable  is 
a  Gaussian  with  zero  mean,  with  variances  added  in  quadrature,  therefore 


/ x\  -  \i 

V  o 


x2~  fi 


G 


~N(  0, 1). 
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Next,  since  independent  variables  are  also  distributed  like  a  distribu¬ 
tion  with  a  number  of  degrees  of  freedom  equal  to  the  sum  of  the  individual 
degrees  of  freedom, 


Z  -  E  +  %  ~  X2(Ni  +N2-  2). 

<rz  az 

We  also  know  the  distribution  of  y/Z/f  from  (7.35),  with /  =  N\  +  N2  —  2 
the  number  of  degrees  of  freedom  for  both  datasets  combined.  As  a  result,  the 
variable  T  can  be  written  as 


(7.39) 


in  a  form  that  is  identical  to  the  T  function  for  comparison  of  sample  mean 
with  the  parent  mean,  and  therefore  we  can  conclude  that  the  random  variable 
defined  in  (7.38)  is  in  fact  a  T  variable  with  /  =  Afi  +  N2  —  2  degrees  of 
freedom.  □ 

Example  7.9  Using  the  ten  measurements  (10, 12, 15, 11, 13,  16, 12, 10, 18,  13), 
we  have  already  calculated  the  sample  mean  of  the  first  and  second  half  of  the 
measurements  as  xf  =  12.2  and  I2  =  13.8,  and  the  sample  variances  as  S2  =  14.8 
and  S\  —  40.8.  This  results  in  a  measurement  of  the  t  distribution  for  the  comparison 
between  two  means  of 


X\  —X2 

Js\W2VW^TTfN2 


-0.97. 


(7.40) 


This  number  is  to  be  compared  with  a  t  distribution  with  8  degrees  of  freedom, 
and  we  conclude  that  the  measurement  is  consistent,  at  any  reasonable  level  of 
confidence,  with  the  parent  distribution.  In  this  case,  we  are  making  a  statement 
regarding  the  fact  that  the  two  sets  of  measurements  may  have  the  same  mean,  but 
without  committing  to  a  specific  value.  O 
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Summary  of  Key  Concepts  for  this  Chapter 

□  Hypothesis  Testing :  A  four- step  process  that  consists  of  (1)  defining  a 
null  hypothesis  to  test,  (2)  determine  the  relevant  statistic  (e.g.,  /2),  (3) 
a  confidence  level  (e.g.,  90%),  and  (4)  whether  the  null  hypothesis  is 
discarded  or  not. 

□  x2  distribution :  The  theoretical  distribution  of  the  sum  of  the  squares  of 
independent  z-scores, 


(mean  N  and  variance  2 N). 

□  Sampling  distribution  of  variance:  Distribution  of  sample  variance  s2  = 
S2/(N-  1), 


S2/a 2  ~  x2(N  —  1) 


□  F  Statistic:  Distribution  of  the  ratio  of  independent  y2  variables 

F  _  xi/fi 
xi/fi. 

(mean/2/(/2  —  2)  for  f2  >  2)  also  used  to  test  for  additional  model 
components. 

□  Student’s  t  distribution :  Distribution  for  the  variable 


x  —  pt 
s/y/n’ 


useful  to  compare  the  sample  mean  to  the  parent  mean  when  the  variance 
is  estimated  from  the  data. 


Problems 

7.1  Five  students  score  70,  75,  65,  70,  and  65  on  a  test.  Determine  whether  the 
scores  are  compatible  with  the  following  hypotheses: 

(a)  The  mean  is  pt  —  75; 

(b)  the  mean  is  pc  —  75  and  the  standard  deviation  is  a  =  5. 

Test  both  hypotheses  at  the  95  %  or  68  %  confidence  levels,  assuming  that  the 
scores  are  Gaussian  distributed. 
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7.2  Prove  that  the  mean  and  variance  of  the  F  distribution  are  given  by  the  following 
relationships, 


[  h 

a  =  - 

I  h  -  2 

'  2 _  2/|(/1+/2-2) 

°  /i(/2-2)2(/2-4)’ 

where  f\  and  /2  are  the  degrees  of  freedom  of  the  variables  at  the  numerator  and 
denominator,  respectively. 

7.3  Using  the  same  data  as  Problem  (7.1),  test  whether  the  sample  variance  is 
consistent  with  a  parent  variance  of  a2  =  25,  at  the  95  %  level. 

7.4  Using  the  J.J.  Thomson  experiment  data  of  page  23,  measure  the  ratio  of 
the  sample  variances  of  the  m/e  measurements  in  Air  for  Tube  1  and  Tube  2. 
Determine  if  the  null  hypothesis  that  the  two  measurements  are  drawn  from  the 
same  distribution  can  be  rejected  at  the  90  %  confidence  level.  State  all  assumptions 
required  to  use  the  F  distribution. 

7.5  Consider  a  dataset  (10, 12, 15, 11, 13, 16, 12, 10, 18, 13),  and  calculate  the  ratio 
of  the  sample  variance  of  the  first  two  measurements  with  that  of  the  last  eight.  In 
particular,  determine  at  what  confidence  level  for  the  null  hypothesis  both  subsets 
are  consistent  with  the  same  variance. 

7.6  Six  measurements  of  the  length  of  a  wooden  block  gave  the  following 
measurements:  20.3,  20.4,  19.8,  20.4,  19.9,  and  20.7  cm. 

(a)  Estimate  the  mean  and  the  standard  error  of  the  length  of  the  block; 

(b)  Assume  that  the  block  is  known  to  be  of  length  fi  —  20  cm.  Establish  if  the 
measurements  are  consistent  with  the  known  length  of  the  block,  at  the  90  % 
probability  level. 

7.7  Consider  Mendel’s  experimental  data  in  Table  1.1  shown  at  page  9. 

(a)  Consider  the  data  that  pertain  to  the  case  of  “Long  vs.  short  stem.”  Write 
an  expression  for  the  probability  of  making  that  measurement,  assuming 
Mendel’s  hypothesis  of  independent  assortment.  You  do  not  need  to  evaluate 
the  expression. 

(b)  Using  the  distribution  function  that  pertains  to  that  measurement,  determine  the 
mean  and  variance  of  the  parent  distribution.  Using  the  Gaussian  approximation 
for  this  distribution,  determine  if  the  null  hypothesis  that  the  measurement  is 
drawn  from  the  parent  distribution  is  compatible  with  the  data  at  the  68% 
confidence  level. 
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7.8  Consider  Mendel’s  experimental  data  in  Table  1 . 1  shown  at  page  9.  Considering 
all  seven  measurements,  calculate  the  probability  that  the  mean  fraction  of  dominant 
characters  agrees  with  the  expectation  of  0.75.  For  this  purpose,  you  may  use  the  t 
statistic. 

7.9  Starting  with  (7.36),  complete  the  derivation  of  (7.34). 

7.10  Show  that  the  t  distribution, 


frit)  — 


i  r((/+  i)/2) 

JN  rif/2) 


-j(f+ 1) 


becomes  a  standard  Gaussian  in  the  limit  of  large  /.  You  can  make  use  of  the 
asymptotic  expansion  of  the  Gamma  function  (A.  17). 


Chapter  8 

Maximum  Likelihood  Methods  for  Two- Variable 
Datasets 


Abstract  One  of  the  most  common  tasks  in  the  analysis  of  scientific  data  is 
to  establish  a  relationship  between  two  quantities.  Many  experiments  feature  the 
measurement  of  a  quantity  of  interest  as  function  of  another  control  quantity  that 
is  varied  as  the  experiment  is  performed.  In  this  chapter  we  use  the  maximum 
likelihood  method  to  determine  whether  a  certain  relationship  between  the  two 
quantities  is  consistent  with  the  available  measurements  and  the  best-fit  parameters 
of  the  relationship.  The  method  has  a  simple  analytic  solution  for  a  linear  function 
but  can  also  be  applied  to  more  complex  analytic  functions. 


8.1  Measurement  of  Pairs  of  Variables 

A  general  problem  in  data  analysis  is  to  establish  a  relationship  y  —  y(x)  between 
two  random  variables  X  and  Y  for  which  we  have  available  a  set  of  N  measurements 
(xt,yi).  The  random  variable  X  is  considered  to  be  the  independent  variable  and 
it  will  be  treated  as  having  uncertainties  that  are  much  smaller  than  those  in  the 
dependent  variable,  i.e.,  ax  oy.  This  may  not  always  be  the  case  and  there  are 
some  instances  in  which  both  errors  need  to  be  considered.  The  case  of  datasets 
with  errors  in  both  variables  is  presented  in  Chap.  12. 

The  starting  point  of  the  analysis  of  a  two-dimensional  dataset  is  an  analytic 
form  for  y(x),  e.g.,  y(v)  =  a  +  bx.  The  function  f(x)  has  a  given  number  of 
adjustable  parameters  a^k  —  1, . . . ,  m  that  are  to  be  constrained  according  to  the 
measurements.  When  the  independent  variable  X  is  assumed  to  be  known  exactly, 
then  the  two-variable  data  set  can  be  described  as  a  sequence  of  random  variables 
Y(Xi).  For  these  variables  we  typically  have  a  measurement  of  the  standard  error 
such  that  the  two-variable  data  are  of  the  form 

(xi,yi±Oi)  i  —  \, ...  ,N. 

An  example  of  this  situation  may  be  a  dataset  in  which  the  size  of  an  object  is 
measured  at  different  time  intervals.  In  this  example  the  time  of  measurement 
ti  is  the  independent  variable,  assumed  to  be  known  exactly,  and  rt  ±  07  is  the 
measurement  of  the  size  at  that  time  interval.  Although  we  call  y(v/)  =  rt  ±  07  a 
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“measurement,”  it  really  may  itself  be  obtained  from  a  number  of  measurements 
from  which  one  infers  the  mean  and  the  variance  of  that  random  variable,  as 
described  in  the  earlier  chapters.  It  is  therefore  reasonable  to  expect  that  the 
measurement  provides  also  an  estimate  of  the  standard  error. 

Before  describing  the  mathematical  properties  of  the  method  used  to  estimate  the 
best-fit  parameters  we  need  to  understand  the  framework  for  the  analysis.  Consider 
as  an  example  the  case  of  a  linear  function  between  X  and  Y  illustrated  in  Fig.  8.1. 
The  main  assumption  of  the  method  is  that  the  function  y  —  y(v)  is  the  correct 
description  of  the  relationship  between  the  two  variables.  This  means  that  each 
random  variable  y(v/)  is  a  Gaussian  with  the  following  parameters: 

{lii  —  y(xi)  the  parent  mean  is  determined  by  y(v) 
of  variance  is  estimated  from  the  data. 

Notice  how  this  framework  is  somewhat  of  a  hybrid:  the  parent  mean  is  determined 
by  the  parent  model  y(v)  while  the  variance  is  estimated  from  the  data.  It  should 
not  be  viewed  as  a  surprise  that  the  model  y  —  y(v)  typically  cannot  determine  by 
itself  the  variance  of  the  variable.  In  fact,  we  know  that  the  variance  depends  on  the 
quality  of  the  measurements  made  and  therefore  it  is  reasonable  to  expect  that  07 
is  estimated  from  the  data  themselves.  In  Sect.  8.2  we  will  use  the  assumption  that 
Y  has  a  Gaussian  distribution,  but  this  need  not  be  the  only  possibility.  In  fact,  in 
Sect.  8.8  we  will  show  how  data  can  be  fit  in  alternative  cases,  such  as  when  the 
variable  has  a  Poisson  distribution. 


Random  variable  X 


Fig.  8.1  In  the  fit  of  two-variable  data  to  a  linear  function,  measurements  of  the  dependent  variable 
Y  are  made  for  few  selected  points  of  the  variable  X  (in  this  example  *1  =  l,Jt2  =  3,  *3  =  5  and 
X4  =  7).  Each  datapoint  is  marked  by  the  circle  with  error  bars.  The  independent  variable  X  is 
assumed  to  be  known  exactly  and  the  size  of  the  error  bar  determines  the  value  of  the  variance 

of  vfc) 
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In  many  cases  the  variables  Y(Xt)  have  a  Gaussian  distribution,  as  illustrated  in 
Fig.  8.1.  The  data  are  represented  by  points  with  an  error  bar  and  the  model  for  each 
data  point  is  a  Gaussian  centered  at  the  value  of  the  parent  model  y(;q).  The  model 
y(v)  can  be  any  function  and,  as  described  in  the  previous  section,  the  standard 
deviation  07  is  estimated  from  the  data  themselves. 

The  goal  of  fitting  data  to  a  model  is  twofold:  to  determine  whether  the  model 
y(v)  is  an  accurate  representation  of  the  data  and,  at  the  same  time,  to  determine 
what  values  of  the  adjustable  parameters  are  compatible  with  the  data.  The  two 
goals  are  necessarily  addressed  together.  The  starting  point  is  the  calculation  of  the 
likelihood  ££  of  the  data  with  the  model  as 


££  —  P  (data/model) 


N 


n 


N 


i=  1 


(8.2) 


In  the  previous  equation  we  have  assumed  that  the  measurements  y*  ±  07  are  inde¬ 
pendent  of  one  other,  so  that  the  Gaussian  probabilities  can  be  simply  multiplied. 
Independence  between  measurements  is  a  critical  assumption  in  the  use  of  the 
maximum  likelihood  method. 

The  core  of  the  maximum  likelihood  method  is  the  requirement  that  the  unknown 
parameters  of  the  model  y  =  y(x)  are  those  that  maximize  the  likelihood  of  the 
data.  This  is  the  same  logic  used  in  the  estimate  of  parameters  for  a  single  variable 
presented  in  Chap.  5.  The  method  of  maximum  likelihood  results  in  the  condition 
that  the  following  function  has  to  be  minimized: 


X 


2 


(8.3) 


In  fact,  the  factor  in  (8.2)  containing  the  product  of  the  sample  variances  is  constant 
with  respect  to  the  adjustable  parameters  and  maximization  of  the  likelihood  is 
obtained  by  minimization  of  the  exponential  term. 

Equation  (8.3)  defines  the  goodness  of  fit  statistic  which  bears  its  name 

from  the  fact  that  it  is  distributed  like  a  / 2  variable.  The  number  of  degrees  of 
freedom  associated  with  this  variable  depends  on  the  number  of  free  parameters 
of  the  model  y(x),  as  will  be  explained  in  detail  in  Chap.  10.  The  simplest  case 
is  that  of  a  model  that  has  no  free  parameters.  In  that  case,  we  know  already  that 
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the  minimum  has  exactly  N  degrees  of  freedom.  Given  the  form  of  (8.3),  the 
maximum  likelihood  method,  when  applied  to  Gaussian  distribution,  is  also  known 
as  the  least  squares  method. 


8.3  Least-Squares  Fit  to  a  Straight  Line, 
or  Linear  Regression 

When  the  fitting  function  is 


y(v)  —  a  bx  (8.4) 

the  problem  of  minimizing  the  /2  defined  in  (8.3)  can  be  solved  analytically.  The 
conditions  of  minimum  /2  are  written  as  partial  derivatives  with  respect  to  the  two 
unknown  parameters: 


f^X2  =  -2  E  -J  0'(  -a-  bxt )  =  0 


d 

-KrX2  =  -2  E  ~jt V;  -a-  bxt)  =  0 

ob  (jr 


e4  =  «e4+^e4 


07 


07 


07 


X? 


E^  =  «e4  +  ^EV 


07 


07 


07 


which  is  a  system  of  two  equations  in  two  unknowns.  The  solution  is 


where 


(8.5) 


(8.6) 


(8.7) 


(8.8) 
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Equation  (8.7)  provides  the  solution  for  the  best-fit  parameters  of  the  linear  model. 
The  determination  of  the  parameters  of  the  linear  model  is  known  as  linear 
regression. 

When  all  errors  are  identical,  07  =  cr,  it  is  easy  to  show  that  the  best-fit 
parameters  estimated  by  the  least- squares  method  are  equivalent  to 

fb  =  Cov(X,Y) 

l  Var(X )  (8.9) 

[a  =  E(Y)  -  bE(X ) 

[see  Problem  (8.9)].  This  means  that,  in  the  absence  of  correlation  between  the 
two  variables,  the  best-fit  slope  will  be  zero  and  the  value  of  a  is  simply  the  linear 
average  of  the  measurements. 


8.4  Multiple  Linear  Regression 

The  method  outlined  above  in  Sect.  8.3  can  be  generalized  to  a  fitting  function  of 
the  form 


v(a')  =  y^/ikfk(x).  (8.10) 

k=  1 


Equation  (8. 10)  describes  a  function  that  is  linear  in  the  m  parameters.  In  this 
case  one  speaks  of  multiple  linear  regression ,  or  simply  multiple  regression.  The 
functions  fk(x)  can  have  any  analytical  form.  The  linear  regression  described  in  the 
previous  section  has  only  two  such  function,  f\  (x)  =  1  and/2(v)  =  v.  A  common 
case  is  when  the  functions  are  polynomials, 


/*(*)=**.  (8.11) 

The  important  feature  to  notice  is  that  the  functions  fk(x)  do  not  depend  on  the 
parameters  a^. 

We  want  to  find  an  analytic  solution  to  the  minimization  of  the  /2  with  the 
fitting  function  in  the  form  of  (8.10).  As  we  have  seen,  this  includes  the  simple 
linear  regression  as  a  special  case.  In  the  process  of  minimization  we  will  also 
determine  the  variance  and  the  covariances  on  the  fitted  parameters  since  no 
fitting  is  complete  without  an  estimate  of  the  errors  and  of  the  correlation  between 
the  coefficients.  As  a  special  case  we  will  therefore  also  find  the  variances  and 
covariance  between  the  fit  parameters  a  and  b  for  the  linear  regression. 
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8.4.1  Best-Fit  Parameters  for  Multiple  Regression 


Minimization  of  j1  with  respect  to  the  m  parameters  ak  is  obtaining  by  taking  partial 
derivatives  over  the  m  unknown  parameters  ak. 

This  yields  the  following  m  equations: 


9 

dai 


^  (y* —  ^2k= i  akfk(xi))2  ^ 


or 


N 


-*E 


yt  - 127=  i  «*/*(*;) 


i=  1 


erf 


1 //(*()  =  0. 


These  equations  can  be  written  as 


E 


m 


E  akfk(Xi) 
k=  1 


leading  to 


E 


m 


V 


=  EafcE 


fk(Xi)fl(Xi ) 


Jfc=l 


/=  1 


af 


(8.12) 


(8.13) 


Equation  (8.13)  are  m  coupled  equations  in  the  parameters  ak ,  which  can  be 
solved  using  matrix  algebra,  as  described  below.  Notice  that  the  term  fi(xt)  is 
the  /th  model  component  (thus  the  index  /  is  not  summed  over),  and  the  index 
i  runs  from  1  to  N,  where  N  is  the  number  of  data  points. 

The  best-fit  parameters  are  therefore  obtained  by  defining  the  row  vectors  and 
a  and  the  mx  m  symmetric  matrix  A  as 


p 

<  a 


N 

in  which  pk  =  ■))>,■/ af 

i=  1 

(model  parameters) 


Aik 


N 


i=  1 


fl(Xi)fk(Xi ) 


(/,  k  component  of  the  mxm  matrix  A) 


8.4  Multiple  Linear  Regression 


153 


With  these  definitions,  (8.13)  can  be  rewritten  in  matrix  form  as 

P  =  aA,  (8.14) 

and  therefore  the  task  of  estimating  the  best-fit  parameters  is  that  of  inverting  the 
matrix  A,  which  can  be  done  numerically.  The  m  best-fit  parameters  are  placed 
in  a  row  vector  a  (of  dimensions  1  x  m)  and  are  given  by 

a  =  fiA~l .  (8.15) 

The  1  x  m  row  vector  p  and  the  mxm  matrix  A  can  be  calculated  from  the  data  and 
the  fit  functions /^(v). 


8.4.2  Parameter  Errors  and  Covariances  for  Multiple 
Regression 


To  calculate  errors  in  the  best-fit  parameters,  we  treat  parameters  a k  as  functions  of 
the  measurements,  a %  =  ajfyi).  Therefore  we  can  use  the  error  propagation  method 
to  calculate  variances  and  covariances  between  parameters  as: 


< 


V 


(8.16) 


We  have  used  the  fact  that  the  error  in  each  measurement  yi  is  given  by  <T;  and  that 
the  measurements  are  independent. 

We  show  that  the  variance  is  given  by  the  IJ  term  of  the  inverse  of  the 
matrix  A,  which  we  define  as  the  error  matrix 


(8.17) 


The  error  matrix  s  is  a  symmetric  matrix,  of  which  the  diagonal  terms  contain  the 
variances  of  the  fitted  parameters  and  the  off-diagonal  terms  contain  the  covariances. 

Proof  Use  the  matrix  equation  a  —  fis  to  write 


ai 


m 


m  N 


X!  =  X  X 

k=  1  k=  1  /=  1 
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The  equation  above  can  be  used  into  (8.16)  to  show  that 


N 


(7 


ai  aj 


=  £ 


i=  1 


x  (V^)  -  x  ( 

k=  1  V  *  7  p=  1  V 


fp(Xi) 

elp  — 


in  which  the  indices  k  and  p  indicate  the  m  model  parameters,  and  the  index  i 
is  used  for  the  sum  over  the  N  measurements. 


o 


ai  aj 


in  hi  ii  _c  /  \  _c  /  \  ,,L 

£jk  6  Ip  ~2  —  £->k  SlpApk- 


k=  1  p=  1  i=  1 


/c=  1  p=  1 


Now  recall  that  A  is  the  inverse  of  e,  and  therefore  the  expression  above  can 
be  simplified  to 


°a/  aj  ~  —  £/'/■ 


(8.18) 


□ 


5.43  Errors  and  Covariance  for  Linear  Regression 

The  results  of  Sect.  8.4.2  apply  also  to  the  case  of  linear  regression  as  a  special 
case.  We  therefore  use  these  results  to  estimate  the  errors  in  the  linear  regression 
parameters  a  and  b  and  their  covariance.  In  this  case,  the  functions //(;q)  are  given, 
respectively,  by  f\(x)  —  1  and/2(v)  =  v  and  therefore  the  matrix  A  is  a  2  x  2 
symmetric  matrix  with  the  following  elements: 


An  = 


A 12  = 


N 

E  i/of 

/=  1 

a2i  =  E*;/°f 
1=1 
N 

2  la2 

l  /  l 

i=  1 


(8.19) 


A22  =  E 


The  inverse  matrix  A  1  =  e  is  given  by 


£ll  —  A22/ A 


£12  —  £21  —  —Ml/ A 
£22  =  An/A 


(8.20) 
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in  which  A  is  the  determinant  of  A.  Using  (8.14)  we  calculate  /3: 


Pi  =  Hyi/of 
Pi  =  T,yixil°f 


(8.21) 


and  thus  proceed  to  calculating  the  best-fit  parameters  and  their  errors.  The  best-fit 
parameters,  already  found  in  Sect.  8.3,  are  given  by 


(a,b)  =  (fix.  Pi) 


£\\  Si2 
£ 21  £22 


which  give  the  same  results  as  previously  found  in  (8.7).  We  are  now  in  a  position 
to  estimate  the  errors  in  the  best-fit  parameters: 


1  N 

n  1=1 


1  N 

=  e22  =  l/°i 

^  1=1 


V 


1  N 

=  ei2  =  r  E*>'/°f- 
n  1=1 


(8.22) 


The  importance  of  (8.22)  is  that  the  errors  in  the  parameters  a  and  b  and  their 
covariance  can  be  computed  analytically  from  the  N  measurements.  This  simple 
solution  make  the  linear  regression  very  simple  to  implement. 


8.5  Special  Cases:  Identical  Errors  or  No  Errors  Available 

It  is  common  to  have  a  dataset  where  all  measurements  have  the  same  error.  When 
all  errors  in  the  dependent  variable  are  identical  (07  =  a)  (8.7)  and  (8.22)  for  the 
linear  regression  are  simplified  to 

r  1  1  N  N  N  N 

a  =  yt  E xf  -  E xi  E xm) 

n  G  i=l  i=l  i=l  i=l 

1  l  N  N  N 

b  =  -r — (w  E  xtyi  -  E  yt  E  *■•) 

cr  /  =  i  1=1  1=1 

1  yv  yv 

zi  =  -(ivE*MEV>2) 

O'  i=l  1=1 


< 


(8.23) 
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and 


1  N 


(8.24) 


A  G2 


The  important  feature  is  that  the  best-fit  parameters  are  independent  of  the  value  a 
of  the  error. 

For  dataset  that  do  not  have  errors  available  it  is  often  reasonable  to  assume  that 
all  datapoints  have  the  same  error  and  calculate  the  best-fit  parameters  without  the 
need  to  specify  the  value  of  a.  The  variances,  which  depend  on  the  error,  cannot 
however  be  estimated.  The  absence  of  errors  therefore  limits  the  applicability  of 
the  linear  regression  method.  It  is  in  general  not  possible  to  reconstruct  the  errors 
Gi  a  posteriori.  In  fact,  the  errors  are  the  result  of  the  experimental  procedure  that 
led  to  the  measurement  of  the  variables.  A  typical  example  is  the  case  in  which 
each  of  the  variables  y(;q)  was  measured  via  repeated  experiments  which  led  to 
the  measurement  of  y(v;)  as  the  mean  of  the  measurements  and  its  error  as  the 
square  root  of  the  sample  variance.  In  the  absence  of  the  “raw”  data  that  permit 
the  calculation  of  the  sample  variance,  it  is  simply  not  possible  to  determine  the 
error  in  07. 

Another  possibility  to  use  a  dataset  that  does  not  report  the  errors  in  the 
measurements  is  based  on  the  assumption  that  the  fitting  function  y  —  f(x)  is  the 
correct  description  for  the  data.  Under  this  assumption,  one  can  estimate  the  errors, 
assumed  to  be  identical  for  all  variables  in  the  dataset,  via  a  model  sample  variance 


defined  as 


(8.25) 


i=  1 


where  is  the  value  of  the  fitting  function  /(v;)  evaluated  with  the  best-fit 
parameters,  which  must  be  first  obtained  by  a  fit  assuming  identical  errors.  The 
underlying  assumption  behind  the  use  of  (8.25)  is  to  treat  each  measurement  y;  as 
drawn  from  a  parent  distribution/^*;),  i  —  1, . . . N,  e.g.,  assuming  that  the  model 
is  the  correct  description  for  the  data.  In  the  case  of  a  linear  regression,  m  —  2, 
since  two  parameters  ( a  and  b)  are  estimated  from  the  data.  It  will  become  clear 
in  Sect.  10.1  that  this  procedure  comes  at  the  expenses  of  the  ability  to  determine 
whether  the  dataset  is  in  fact  well  fit  by  the  function  y  —  f(x),  since  that  is  the 
working  assumption. 

In  the  case  of  no  errors  reported,  it  may  not  be  clear  which  variable  is  to  be 
treated  as  independent.  We  have  shown  in  (8.9)  that,  when  no  errors  are  reported, 


8.6  A  Classic  Experiment:  Edwin  Hubble’s  Discovery  of  the  Expansion  of  the. . . 
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the  best- fit  parameters  can  be  written  as 

fb=  Cov(X,Y) 

J  Var(X) 

[a  =  E(Y)  -  bE(X). 

This  equation  clearly  shows  that  the  best-fit  linear  regression  model  is  dependent  on 
the  choice  of  which  between  v  and  y  is  considered  the  independent  variable.  In  fact, 
if  y  is  regarded  as  the  independent  variable,  and  the  data  fit  to  the  model 


x  —  a'  b'y 


(8.26) 


the  least-squares  method  gives  the  best-fit  slope  of 


y  =  Cov(X,Y) 
Var(Y ) 

When  the  model  is  rewritten  in  the  usual  form 


J  —  ax/Y  +  bx/yx 


in  which  the  notation  X/Y  means  “X  given  Y”  the  best-fit  model  parameters  are 

f  1  Var(Y) 

I  x/y  ~  -  J,  ~  Cov(X,  Y) 

{ ax/Y  =  E(Y )  -  bx/yE(X) 

and  therefore  the  two  linear  models  assuming  v  or  y  as  independent  variable  will 
be  different  from  one  another.  It  is  up  to  the  data  analyst  to  determine  which  of 
the  two  variables  is  to  be  considered  as  independent  when  there  is  a  dataset  of 
( Xi,yi )  measurements  with  no  errors  reported  in  either  variable.  Normally  the  issue  is 
resolved  by  knowing  how  the  experiment  was  performed,  e.g.,  which  variable  had  to 
be  assumed  or  calculated  first  in  order  to  calculate  or  measure  the  second.  Additional 
considerations  for  the  fit  of  two-variable  datasets  are  presented  in  Chap.  12. 


8.6  A  Classic  Experiment:  Edwin  Hubble’s  Discovery 
of  the  Expansion  of  the  Universe 

In  the  early  twentieth  century  astronomers  were  debating  whether  “nebulae,” 
now  known  to  be  external  galaxies,  were  in  fact  part  of  our  own  Galaxy,  and 
there  was  no  notion  of  the  Big  Bang  and  the  expansion  of  the  universe.  Edwin 
Hubble  pioneered  the  revolution  via  a  seemingly  simple  observation  that  a 


(continued) 
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number  of  “nebulae”  moved  away  from  the  Earth  with  a  velocity  v  that  is 
proportional  to  their  distance  d ,  known  as  Hubble  ’s  law 


v  —  Hod.  (8.27) 

The  quantity  Ho  is  the  Hubble  constant ,  typically  measured  in  the  units  of 
kms-1  Mpc-1,  where  Mpc  indicates  a  distance  of  106  parsec.  The  data  used 
by  Hubble  [21]  is  summarized  in  Table  8.1. 

The  quantity  m  is  the  apparent  magnitude,  related  to  the  distance  via  the 
following  relationship, 


log  d  — 


m  —  M  +  5 
5 


(8.28) 


where  M  —  —13.8  is  the  absolute  magnitude,  also  measured  by  Hubble  as 
part  of  the  same  experiment,  and  considered  as  a  constant  for  the  purpose  of 
this  dataset,  and  d  is  measured  in  parsecs. 

The  first  part  of  the  experiment  consisted  in  fitting  the  ( v ,  m )  dataset  to  a 
relationship  that  is  linear  in  log  v. 


log  v  =  a  +  b  •  m  (8.29) 

where  a  and  b  are  the  adjustable  parameters  of  the  linear  regression.  Instead 
of  performing  the  linear  regression  described  in  Sects.  8.3  and  8.4.3,  Hubble 
reported  two  different  fit  results,  one  in  which  he  determined  also  the  error 
in  a , 


log  v  =  (0.202  ±  0.007)  •  m  +  0.472  (8.30) 

and  one  in  which  he  fixed  a  =  0.2,  and  determined  the  error  in  b: 

log  v  —  0.2  •  m  +  0.507  ±  0.012.  (8.31) 

Using  (8.31)  into  (8.28),  Hubble  determined  the  following  relationship 
between  velocity  and  distance, 

log  -  =  0.2 M  -  0.493  =  -3.253  (8.32) 

d 

and  this  results  in  the  measurement  of  his  name-sake  constant,  Ho  —  v/d  — 
1 0  3.253  _  x  to-6 kms-1  pc-1,  or  558 kms-1  Mpc-1. 


(continued) 
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Table  8.1  Data  from  E.  Hubble’s  measurements 


Name  of  nebula 

Mean  velocity  km  s  1 

Number  of  velocities 

Mean  m 

Virgo 

890 

7 

12.5 

Pegasus 

3810 

5 

15.5 

Pisces 

4630 

4 

15.4 

Cancer 

4820 

2 

16.0 

Perseus 

5230 

4 

16.4 

Coma 

7500 

3 

17.0 

Ursa  Major 

11,800 

1 

18.0 

Leo 

19,600 

1 

19.0 

(No  name) 

2350 

16 

13.8 

(No  name) 

630 

21 

11.6 

Fig.  8.2  Best-fit  linear 
regression  model  for  the  data 
in  Table  8.1 


Example  8. 1  The  data  from  Hubble’s  experiment  are  a  typical  example  of  a  dataset 
in  which  no  errors  were  reported.  A  linear  fit  can  be  initially  performed  by  assuming 
equal  errors,  and  the  best-fit  line  is  reported  in  red  in  Fig.  8.2.  Using  (8.25),  the 
common  errors  in  the  dependent  variables  log  v(xi)  are  found  to  be  a  =  0.06,  the 
best-fit  parameters  of  the  models  are  a  —  0.55  ±  0.13,  b  —  0.197  ±  0.0085,  and 
the  covariance  is  o^b  —  — 1.12  x  10-3,  for  a  correlation  coefficient  of  —0.99.  The 
uncertainties  and  the  covariance  are  measured  using  the  method  of  (8.23).  The  best- 
fit  line  is  shown  in  Fig.  8.2  as  a  solid  line.  O 
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8.7  Maximum  Likelihood  Method  for  Non-linear  Functions 

The  method  described  in  Sect.  8.4  assumes  that  the  model  is  linear  in  the  fitting 
parameters  a^.  This  requirement  is,  however,  not  necessary  to  apply  the  maximum 
likelihood  criterion.  We  can  assume  that  the  relationship  y  —  f(x)  has  any  analytic 
form  and  still  apply  the  maximum  likelihood  criterion  for  the  N  measurements  [see 
(8.3)].  The  best- fit  parameters  are  still  those  that  minimize  the  /2  statistic.  In  fact,  all 
considerations  leading  to  (8.3)  do  not  require  a  specific  form  for  the  fitting  function 
y  =  f(x).  The  assumption  that  must  still  be  satisfied  is  that  each  variable  yt  is 
Gaussian  distributed,  in  order  to  obtain  the  likelihood  in  the  form  of  (8.2). 

The  only  complication  for  nonlinear  functions  is  that  an  analytic  solution  for 
the  best-fit  values  and  the  errors  is  in  general  no  longer  available.  This  is  often  not 
a  real  limitation,  since  numerical  methods  to  minimize  the  / 2  are  available.  The 
most  straightforward  way  to  achieve  a  minimization  of  the  as  function  of  all 
parameters  is  to  construct  an  m  dimensional  grid  of  all  possible  parameter  values, 
evaluate  the  /2  at  each  point,  and  then  find  the  global  minimum.  The  parameter 
values  corresponding  to  this  minimum  can  be  regarded  as  the  best  estimate  of  the 
model  parameters.  The  direct  grid- search  method  becomes  rapidly  unfeasible  as  the 
number  of  free  parameters  increases.  In  fact,  the  full  grid  consists  of  nm  points, 
where  n  is  the  number  of  discrete  points  into  which  each  parameter  is  investigated. 
One  typically  wants  a  large  number  of  n ,  so  that  parameter  space  is  investigated  with 
the  necessary  resolution,  and  the  time  to  evaluate  the  entire  space  depends  on  how 
efficiently  a  calculation  of  the  likelihood  can  be  obtained.  Among  the  methods  that 
can  be  used  to  bypass  the  calculation  of  the  entire  grid,  one  of  the  most  efficient  and 
popular  is  the  Markov  chain  Monte  Carlo  technique,  which  is  discussed  in  detail  in 
Chap.  16. 

To  find  the  uncertainties  in  the  parameters  using  the  grid  search  method  requires 
a  knowledge  of  the  expected  variation  of  the  /2  around  the  minimum.  This  problem 
will  be  explained  in  the  next  chapter.  The  Markov  chain  Monte  Carlo  also  technique 
provides  estimates  of  the  parameter  errors  and  their  covariance. 


8.8  Linear  Regression  with  Poisson  Data 

The  two  main  assumptions  made  so  far  in  the  maximum  likelihood  method  are 
that  the  random  variables  y(v;)  are  Gaussian  and  the  variance  of  these  variables  are 
estimated  from  the  data  as  the  measured  variance  a2.  In  the  following  we  discuss 
how  the  maximum  likelihood  method  can  be  applied  to  data  without  making  the 
assumption  of  a  Gaussian  distribution.  One  case  of  great  practical  interest  is  when 
variables  have  Poisson  distribution,  which  is  the  case  in  many  counting  experiments. 
For  simplicity  we  focus  on  the  case  of  linear  regression,  although  all  considerations 
can  be  extended  to  any  type  of  fitting  function. 

When  y(xi )  is  assumed  to  be  Poisson  distributed,  the  dataset  takes  the  form  of 
(xi,yi),  in  which  the  values  yt  are  intended  as  integers  resulting  from  a  counting 
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experiment.  In  this  case,  the  value  y(xi )  =  a  +  bxi  is  considered  as  the  parent  mean 
for  a  given  choice  of  parameters  a  and  b , 

lii  —  y(xi )  =  a  +  bxi.  (8.33) 


The  likelihood  is  calculated  using  the  Poisson  distribution  and,  under  the 
hypothesis  of  independent  measurements,  it  is 


^=n 


(8.34) 


Once  we  remove  the  Gaussian  assumption,  there  is  no  /2  function  to  minimize,  but 
the  whole  likelihood  must  be  taken  into  account.  It  is  convenient  to  minimize  the 
logarithm  of  the  likelihood, 


N  N 

InJz?  =  T>lnyfe)  -  7>,,)  +  A 

1=1  i=  1 


(8.35) 


where  A  —  —  ^  lny*!  does  not  depend  on  the  model  parameters  but  only  on  the 
fixed  values  of  the  datapoints.  Minimization  of  the  logarithm  of  the  likelihood  is 
equivalent  to  a  minimization  of  the  likelihood,  since  the  logarithm  is  a  monotonic 
function  of  its  argument.  The  principle  of  maximum  likelihood  requires  that 


(  d 


da 

d 


f  db 


InJjf  =  0 


In  Jzf  =  0 


yt 

a  +  bxi 


xtyi 

a  +  bxi 


(8.36) 


The  fact  that  the  minimization  was  done  with  respect  to  In  Jzf  instead  of  /2  is 
a  significant  difference  relative  to  the  case  of  Gaussian  data.  For  Poisson  data  we 
define  the  fit  statistic  C  as 


C  =  — 21n  Jzf  +  B,  (8.37) 

where  B  is  a  constant  term.  This  is  called  the  Cash  statistic ,  after  a  paper  by  Cash  in 
1979  [9].  This  statistic  will  be  discussed  in  detail  in  Sect.  10.2  and  it  will  be  shown 
to  have  the  property  of  being  distributed  like  a  / 2  distribution  with  N  —  m  degrees 
of  freedom  in  the  limit  of  large  N.  This  result  is  extremely  important,  as  it  allows 
to  proceed  with  the  Poisson  fitting  in  exactly  the  same  way  as  in  the  more  common 
Gaussian  case  in  order  to  determine  the  goodness  of  fit. 

There  are  many  cases  in  which  a  Poisson  dataset  can  be  approximated  with  a 
Gaussian  dataset,  and  therefore  use  /2  as  fit  statistic.  When  the  number  of  counts 
in  each  measurement  yt  is  approximately  larger  than  10  or  so  (see  Sect.  3.4),  the 
Poisson  distribution  is  accurately  described  by  a  Gaussian  of  same  mean  and 
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variance.  When  the  number  of  counts  is  lower,  one  method  to  turn  a  Poisson  dataset 
into  a  Gaussian  one  is  to  bin  the  data  into  fewer  variables  of  larger  count  rates.  There 
are,  however,  many  situations  in  which  such  binning  is  not  desirable,  especially 
when  the  dependent  variable  y  has  particular  behaviors  for  certain  values  of  the 
independent  variable  v.  In  those  cases,  binning  of  the  data  smears  those  features, 
which  we  would  like  to  retain  in  the  datasets.  In  those  cases,  the  best  option  is  to 
use  the  Poisson  fitting  method  described  in  this  section,  and  use  C  as  the  fit  statistic 
instead. 

Example  8.2  Consider  a  set  of  N  =  4  measurements  (3, 5, 4, 2)  to  be  fit  to  a  constant 
model,  y  —  a.  In  this  case,  (8.36)  become 


a  — 


1 

N 


N 


which  means  that  the  maximum  likelihood  estimator  of  a  constant  model,  for  a 
Poisson  dataset,  is  the  average  of  the  measurements.  The  maximum  likelihood  best- 
fit  parameter  is  therefore  a  —  3.5.  O 


Summary  of  Key  Concepts  for  this  Chapter 

□  ML  fit  to  two-dimensional  data :  A  method  to  find  best-fit  parameters  of 
a  model  fit  to  x,y  data  assuming  that  one  variable  (typically  x)  is  the 
independent  variable. 

□  Linear  regression :  ML  fit  to  a  linear  model,  best- fit  parameters  when  all 
errors  are  identical  are 


(b  =  Cov(X,Y) 
l  Var(X ) 

[a  =  E[Y]  -  bE[X] 

(assuming  v  as  independent  variable). 

□  Multiple  linear  regression :  An  extension  of  the  linear  regression  to  models 
of  the  type 


L,arfk(x)- 


□  Model  sample  variance :  When  errors  in  the  dependent  variable  (y)  are  not 
known,  they  can  be  estimated  via  the  model  sample  variance 


- —  XA  -  yS 

—  m  <  * 


N  —  m 


where  m  is  the  number  of  model  parameters. 
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Problems 

8.1  Consider  the  data  from  Hubble’s  experiment  in  Table  8.1. 

(a)  Determine  the  best-fit  values  of  the  fit  to  a  linear  model  for  ( m ,  log  v)  assuming 
that  the  dependent  variables  have  a  common  value  for  the  error. 

(b)  Using  the  best-fit  model  determined  above,  estimate  the  error  from  the  data  and 
the  best-fit  model,  and  then  estimate  the  errors  in  the  parameters  a  and  b ,  and 
the  correlation  coefficient  between  a  and  b. 

(c)  Calculate  the  minimum  / 2  of  the  linear  fit,  using  the  common  error  as  estimated 
in  part  (a). 

8.2  Consider  the  following  two-dimensional  data,  in  which  X  is  the  independent 

variable,  and  Y  is  the  dependent  variable  assumed  to  be  derived  from  a  photon¬ 
counting  experiment: 


Xi 

yt 

0.0 

25 

1.0 

36 

2.0 

47 

3.0 

64 

4.0 

81 

(a)  Determine  the  errors  associated  with  the  dependent  variables  F*. 

(b)  Find  the  best-fit  parameters  a ,  b  of  the  linear  regression  curve 

y(x)  —  a  +  bx\ 

also  compute  the  errors  in  the  best-fit  parameters  and  the  correlation  coefficient 
between  them; 

(c)  Calculate  the  minimum  /2  of  the  fit,  and  the  corresponding  probability  to 
exceed  this  value. 

8.3  Consider  the  following  Gaussian  dataset  in  which  the  dependent  variables  are 
assumed  to  have  the  same  unknown  standard  deviation  a, 


Xi 

yt 

0.0 

0.0 

1.0 

1.5 

2.0 

1.5 

3.0 

2.5 

4.0 

4.5 

5.0 

5.0 
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The  data  are  to  be  fit  to  a  linear  model. 

(a)  Using  the  maximum  likelihood  method,  find  the  analytic  relationships  between 

'Yhxiyu  and  m°del  parameters  a  and  b. 

(b)  Show  that  the  best-fit  values  of  the  model  parameters  are  a  —  0  and  b  —  1 . 

8.4  In  the  case  of  a  maximum  likelihood  fit  to  a  2-dimensional  dataset  with 
equal  errors  in  the  dependent  variable,  show  that  the  conditions  for  having  best-fit 
parameters  a  —  0  and  b  —  1  are 


(  N  N 

E  yi  =  E  xt 

i=l  i=l 

< 

N  N 

E  A  =  E  x(yi‘ 

<i=  1  i=  1 


(8.38) 


8.5  Show  that  the  best-fit  parameter  b  of  a  linear  fit  to  a  Gaussian  dataset  is 
insensitive  to  a  change  of  all  datapoints  by  the  same  amount  Ax,  or  by  the  same 
amount  Ay.  You  can  show  that  this  property  applies  in  the  case  of  equal  errors  in 
the  dependent  variable,  although  the  same  result  applies  also  for  the  case  of  different 
errors. 

8.6  The  background  rate  in  a  measuring  apparatus  is  assumed  to  be  constant  with 
time.  N  measurements  of  the  background  are  taken,  of  which  N/2  result  in  a  value 
of  y +  A,  and  N  /  2  in  a  value  y— A.  Determine  the  sample  variance  of  the  background 
rate. 

8.7  Find  an  analytic  solution  for  the  best-fit  parameters  of  a  linear  model  to  the 
following  Poisson  dataset: 


X 

y 

-2 

-l 

-1 

0 

0 

l 

1 

0 

2 

2 

8.8  Use  the  data  provided  in  Table  6.1  to  calculate  the  best-fit  parameters  a  and  b 
for  the  fit  to  the  radius  vs.  pressure  ratio  data,  and  the  minimum  /2.  For  the  fit,  you 
can  assume  that  the  radius  is  known  exactly,  and  that  the  standard  deviation  of  the 
pressure  ratio  is  obtained  as  a  linear  average  of  the  positive  and  negative  errors. 

8.9  Show  that,  when  all  measurement  errors  are  identical,  the  least  squares 
estimators  of  the  linear  parameters  a  and  b  are  given  by  b  —  Cov(X,  Y)/Var(X) 
and  a  =  E(Y )  -  bE(X). 


Chapter  9 

Multi- Variable  Regression 


Abstract  In  many  situations  a  variable  of  interest  depends  on  several  other 
variables.  Such  multi- variable  data  is  common  across  the  sciences  and  in  many  other 
fields  such  as  economics  and  business.  Multi- variable  analysis  can  be  performed  in 
a  simple  and  effective  way  when  the  relationship  that  links  the  variable  of  interest  to 
the  other  quantities  is  linear.  In  this  chapter  we  study  the  method  of  multi- variable 
regression  and  show  how  it  is  related  to  the  multiple  regression  described  in  Chap.  8 
which  applies  to  the  traditional  two-variable  dataset.  This  chapter  also  presents 
methods  for  hypothesis  testing  on  the  multi- variable  regression  and  its  parameters. 


9.1  Multi- Variable  Datasets 

Two-dimensional  dataset  studied  so  far  include  an  independent  variable  (X)  and  a 
dependent  variable  (T)  and  the  data  take  the  form  of  a  collection  of  (jq,  y*  ±  07), 
where  i  =  1 , ,N  and  N  indicates  the  total  number  of  measurements.  In  Chap.  8 
we  have  developed  a  method  to  fit  such  two-dimensional  data.  In  that  case,  the  linear 
regression  formula  takes  the  form  of  y  (x)  =  a+ bx,  where  a  and  b  are  the  parameters 
of  the  linear  regression. 

Datasets  that  have  measurements  for  three  or  more  variables  are  referred  to 
as  multi- variable  datasets.  An  example  of  multi- variable  dataset  is  presented  in 
Sect.  9.2,  which  reports  measurements  of  different  characteristics  of  irises  per¬ 
formed  by  Fisher  and  Anderson  in  1936  [14].  Each  of  those  measurement  comprises 
four  quantities:  the  sepal  length,  sepal  width,  petal  length,  and  petal  width  of  50 
irises.  For  several  multi- variable  datasets  such  as  that  of  Fisher  and  Anderson  it 
is  often  unclear  which  variable  is  the  dependent  one.  It  typically  depends  on  what 
question  we  want  to  address  with  the  data:  if  we  want  to  determine  the  sepal  length 
of  an  iris  flower  based  on  the  sepal  width,  petal  length,  and  petal  width,  then 
the  sepal  length  becomes  the  dependent  variable  and  the  remaining  three  are  the 
independent  variables. 

Using  multi- variable  datasets  to  predict  or  forecast  the  behavior  of  one  quantity 
based  on  several  other  variables  is  a  fundamental  topic  in  data  analysis.  It  is 
common  throughout  the  sciences  and  especially  used  in  such  fields  as  economics 
or  behavioral  sciences,  where  a  number  of  possible  factors  can  be  used  to  predict 
one  quantity  of  interest.  An  example  is  to  predict  the  score  on  a  college-admission 
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test  based  on  factors  such  as  the  grade-point  average  during  the  sophomore  and  the 
junior  year,  a  measure  of  the  motivation  of  the  student  and  their  economic  status. 
Another  example  is  to  predict  the  price  of  a  stock  based,  e.g.,  on  the  overall  index  of 
the  stock  exchange,  a  consumer’s  index  for  goods  in  the  relevant  class  and  the  rate 
of  treasury  bonds.  To  address  any  such  questions  clearly  requires  a  multi- variable 
dataset  that  has  several  measurements  for  all  quantities  of  interest. 

In  this  chapter  we  develop  a  method  to  determine  the  relationship  between  one  of 
the  quantities  of  a  multi-dimensional  datasets  based  on  the  others,  assuming  a  linear 
relationship  among  the  variables.  This  method  will  also  let  us  study  whether  one 
or  more  of  the  quantities  are  in  fact  not  useful  in  predicting  the  variable  of  interest. 
For  example,  we  may  find  that  the  treasury  bond  rates  are  irrelevant  in  predicting 
the  stock  value  of  a  given  corporation  and  therefore  we  can  focus  only  on  those 
variables  that  are  useful  in  predicting  its  stock  price. 


9.2  A  Classic  Experiment:  The  R.A.  Fisher  and 
E.  Anderson  Measurements  of  Iris  Characteristics 

R.A.  Fisher  is  one  of  the  fathers  of  modern  statistics.  In  1936  he  published  the 
paper  The  Use  of  Multiple  Measurements  in  Taxonomic  Problems  reporting 
measurements  of  several  characteristics  of  three  species  of  the  iris  plant  [14]. 

Figure  9.1  reproduces  the  original  measurements,  performed  by  E.  Ander¬ 
son,  of  the  petal  length  and  the  sepal  length  of  150  iris  plants  of  the  species 
Iris  setosa ,  Iris  versicolor ,  and  Iris  virginica.  The  measurements  are  in  milli¬ 
meters  (mm).  Fisher’s  aim  was  to  find  a  linear  combination  of  the  four 
characteristics  that  would  be  best  suited  to  identify  one  species  from  the 
others.  It  is  already  clear  from  the  data  in  Fig.  9.1  that  one  of  the  quantites 
(e.g.,  the  sepal  length)  may  be  used  as  a  discriminator  among  the  three  species. 
R.A.  Fisher  used  this  dataset  to  find  a  linear  combination  of  the  four  quantities 
that  would  improve  the  classification  of  irises. 

The  dataset  is  a  classic  example  of  a  multi- variate  dataset,  in  which  several 
variables  are  measured  simultaneously  and  independently.  In  addition  to 
Fisher’s  original  purpose,  these  data  can  also  be  used  to  determine  whether 
one  of  the  characteristics,  e.g.,  the  sepal  length,  can  be  efficiently  predicted 
based  on  any  (or  all)  of  the  other  characteristics.  For  example,  one  could 
expect  that  the  length  of  the  sepal  (which  is  part  of  the  calyx  of  the  flower)  is 
related  linearly  to  its  width,  or  to  the  length  of  the  petal.  Assuming  a  linear 
relationship  among  the  variables,  we  set 

SL  =  a  +  bSW  +  cPL  +  dPW  (9.1) 

where  a ,  b ,  c,  and  d  are  coefficients  that  we  can  estimate  from  the  data  using 
the  method  described  in  Sect.  9.3. 


(continued) 
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Throughout  this  chapter  we  use  these  data  to  study  the  linear  regression  of 
(9.1)  for  the  species  Iris  setosa.  We  will  find  that  the  most  important  variable 
needed  to  predict  the  sepal  length  is  the  sepal  width,  while  the  measurements 
of  characteristics  of  petals  are  not  very  important  in  predicting  the  sepal 
length. 
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Fig.  9.1  Measurements  of  three  iris  species  from  the  1936  R.A.  Fisher  paper  [14].  Measure¬ 
ments  are  in  mm 
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9.3  The  Multi-Variable  Linear  Regression 

Consider  a  dataset  of  N  measurements  of  m  +  1  variables  which  we  call  7,  X\,  . . ., 
Xm.  We  can  use  the  index  i  to  indicate  the  measurement,  i  —  1 , ,N,  and  the  index 
k  for  the  variables  Xk,  k  =  1 , ,m.  Each  set  of  measurements  is  therefore  indicated 

(yi  i  (Ti,X\i,  . . . ,  xmi ) . 

We  write  the  variable  7  as  a  linear  function  of  the  m  variables  Xt, 

m 

y{pc)  —  ao  +  a\X\  + - 1-  amxm  —  ao  +  cikXk-  (9.2) 

k=  i 


The  goal  is  to  find  the  values  for  the  m  +  1  coefficients  <2^,  k  —  0 , ,m  that 
minimize  the  /2  function 


Z 


2 


(9.3) 


The  quantity  y(xi)  —  ao  +  aivi/  +  •  •  •  +  amxmi  is  the  value  of  y(x)  calculated  for 
the  i- th  set  of  measurements  of  the  Xk  s.  The  coefficient  r?o  is  an  overall  offset, 
equivalent  to  the  constant  a  for  the  two-dimensional  linear  regression  function  y  — 
a  +  bx. 

This  form  for  the  / 2  function  is  the  same  as  that  used  for  the  multiple  linear 
regression  of  Sect.  8.4.  The  only  change  is  that  the  measurements  Xki  take  the  place 
of  the  functions /^(v/).  The  quantity  is  interpreted  as  the  error  in  the  variable  7, 
which  is  the  dependent  quantity  in  this  regression.  As  in  the  case  of  the  two-variable 
dataset,  we  ignore  the  errors  in  the  variables  Xk  (see  Chap.  12  for  an  extension  of 
the  two- variable  dataset  regression  with  errors  in  both  variables).  When  the  multi- 
variable  dataset  has  no  errors,  or  if  we  choose  to  ignore  the  errors  in  the  7  variable 
as  well,  we  can  omit  the  term  in  (9.3).  This  corresponds  to  assuming  a  uniform 
error  for  all  measurements. 

The  similarity  in  form  between  the  / 2  functions  to  minimize  for  the  present 
multi- variable  linear  regression  and  the  multiple  regression  of  Sect.  8.4  means  that 
we  have  already  at  hand  a  solution  for  the  coefficients  of  the  regression  and  their 
errors.  We  need  to  make  the  following  substitutions: 

I/i(v)  =  1  =  *0  (thus  xot’s  are  not  needed)  ^ 

fk+i(x)  =  xk,  k  =  1, . . .  ,m. 

and  use  the  solution  from  Sect.  8.4  with  m  +  1  terms.  The  best-fit  parameters  ak  can 
be  found  via  the  matrix  equation 


a  —  /3A  1 , 


(9.5) 
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where  the  row  vectors  /?  and  a  and  the  (m+l)x(m+l)  symmetric  matrix  A  are 
given  by 


<  a 

Aik 


N 

=  (Po,  Pi ,  •  •  • ,  pm)  in  which  f}k  =  xkiyi/ of 

1=1 


xlixki 


(/,  k  component  of  A). 


The  errors  and  covariances  among  parameters  are  likewise  given  by  the  error  matrix 
€  =  A~[ .  Assuming  a  constant  value  for  the  variance  a2  (i.e.,  uniform  measurement 
errors),  the  matrix  A  and  the  vector  /3  can  be  written  in  extended  form  as 


A  = 


oa 


N 

£*i«  £4 


•  £•*, 

•  £*i  ,-x, 


mi 


xmi  Xniix  1 


(9.6) 


£4  J 


•  •  5  ^  ^  xmiyi ) 


(9.7) 


where  all  sums  are  over  the  X  measurements.  An  estimate  for  the  variance  a2  is 
given  by 


1 

N  —  m  —  1 


v 


Eb.--y.-)2 


(9.8) 


where  yt  —  ao  +  aiM/  +  •  •  •  +  is  calculated  for  the  best-fit  values  of  the 

coefficients  a^. 

An  alternative  notation  for  finding  the  coefficients  a *  makes  use  of  the  following 
definitions: 


y  i 

1  X\\  .  .  .  X\m 

a0 

y  = 

T2 

\X  = 

1  X21  •  •  •  X2  m 

and  a  — 

a\ 

(9.9) 

_)X_ 

_  1  XN\  .  .  . 

_  a  m  _ 

where  X  is  called  the  design  matrix  and  we  have  arranged  the  Y  measurements  and 
the  vector  of  coefficients  in  column  vectors.  With  this  notation,  the  least-squares 
approach  gives  the  following  solution  for  the  coefficients  [41]: 


a  =  (XtX)~1XtY 


(9.10) 
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It  is  easy  to  show  that  (9.5)  and  (9.10)  are  equivalent  (see  Problem  9.3).  Using  this 
notation,  the  error  matrix  is  given  by 

€  =  s2(XTX)~l.  (9.11) 

We  therefore  have  two  equivalent  methods  to  calculate  the  coefficients  of  the 
multiple  regression  and  their  errors.  The  latter  form  (9.9)  may  be  convenient  if  the 
data  are  already  tabulated  according  to  the  form  of  matrix  A  and  therefore  a  can  be 
found  using  the  matrix  algebra  of  (9. 10).  The  drawback  is  that  the  design  matrix  can 
be  of  very  large  size,  N  x  (m  +  1),  where  N  is  the  number  of  measurements.  The 
form  of  (9.5)  is  more  compact,  since  the  matrix  A  is  (m  +  1)  x  (m  +  1),  and  the 
summation  over  the  N  measurements  must  be  performed  beforehand  to  obtain  A. 


9.4  Tests  for  Significance  of  the  Multiple  Regression 
Coefficients 

The  multi- variable  linear  regression  model  of  (9.2)  is  specified  by  the  m  +  1 
coefficients  a^.  After  determining  their  best-fit  values  and  errors,  it  is  necessary  to 
establish  whether  the  model  is  an  accurate  representation  of  the  data  and  whether 
there  are  any  independent  variables  that  do  not  provide  significant  contribution 
to  the  prediction  of  the  Y  variable.  Both  tasks  can  be  performed  using  hypothesis 
testing  on  the  relevant  statistic.  We  discuss  these  tests  of  significance  using  the 
Fisher’s  data  of  Sect.  9.2 


9.4.1  T-Test  for  the  Significance  of  Model  Components 

It  is  necessary  to  test  the  significance  of  each  of  the  m  +  1  parameters  of  the  multi- 
variable  linear  regression.  The  null  hypothesis  is  that  their  true  value  is  zero,  i.e.,  the 
corresponding  variable  is  not  needed  in  the  model.  For  this  purpose,  we  show  that 
the  ratio  of  the  parameter’s  best-fit  value  cif  and  its  standard  deviation  Sk, 

ah 

tk  =  —  (9.12) 

Sk 

is  distributed  like  a  Student’s  t  distribution  with  N  —  m  —  1  degrees  of  freedom. 

Proof  Following  the  derivation  provided  in  Sect.  7.5.1  for  the  sample  mean, 
we  can  write 


tk 


( &k  hk)  f  @k 

Sk/(*k 


(9.13) 
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where  fik  =  0  is  the  null  hypothesis  and  aA2  is  the  unknown  parent  variance  for 
the  parameter.  Recall  that  the  sample  variance  of  the  parameter  s\  is  obtained 
as  a  product  of  the  diagonal  term  in  the  error  matrix  and  the  estimate  of  the 
data  variance  s1.  Accordingly  we  set 


(TV  —  m  — 


EC yt-yd: 


~  X2(N  —  m  —  1), 


(9.14) 


i.e.,  the  denominator  of  4  can  be  written  as  a  function  of  a  variable  that  is  y2- 
distributed.  It  is  also  clear  that,  under  the  null  hypothesis,  pk  =  0  is  the  parent 
value  of  ak,  and  therefore  the  numerator  of  4  is  distributed  like  a  standard 
normal  distribution. 

It  follows  that  4  is  distributed  like  a  t  distribution, 


4 


N(  0, 1) 


JxW  —  m  —  1  )/(N  —  m  —  1) 


t(N  —  m  —  1) 


(9.15) 


according  to  the  definition  of  the  t  distribution  of  (7.33).  □ 

To  test  for  the  significance  of  coefficient  £4  we  therefore  use  the  critical  value  for 
the  t  distribution  for  the  appropriate  number  of  degrees  of  freedom  and  the  desired 
confidence  level. 


Example  9.1  ( Multi-Variable  Linear  Regression  on  Iris  setosa  Data )  The  data  of 
Fig.  9. 1  for  the  Iris  setosa  species  are  fit  to  the  linear  model  of  (9.1),  where  the  sepal 
length  is  used  as  the  Y  variable  and  the  remaining  three  variables  are  the  independent 
variables.  Using  (9.5)  and  the  inverse  of  matrix  A  for  the  errors,  we  find  the  results 
shown  in  Table  9.1,  including  the  t  scores  for  the  four  parameters  of  the  multiple 
regression. 

For  each  parameter  is  reported  the  probability  to  exceed  the  absolute  value  of 
the  measured  t  according  to  a  t  distribution  with  /  =  46  degrees  of  freedom,  where 
f  —  N  —  m  —  1  with  N  —  50  measurements  and  m  —  3  independent  variable.  It 
is  clear  that  the  parameters  <22  and  a 3,  corresponding  to  the  petal  length  and  width, 
are  not  significant  because  of  the  large  probability  p  to  exceed  their  value  under  the 
null  hypothesis.  Accordingly,  it  would  be  meaningful  to  repeat  the  linear  regression 
using  only  the  sepal  width  as  an  estimator  for  the  sepal  length.  O 


Table  9.1  Multiple 
regression  parameter  for  the 
Iris  setosa  data 


Parameter 

Best-fit  value 

Error 

t  score 

p  value 

a0 

2.352 

0.393 

5.99 

<  0.001 

a\ 

0.655 

0.092 

7.08 

<  0.001 

a2 

0.238 

0.208 

1.14 

0.26 

CI3 

0.252 

0.347 

0.73 

0.47 
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9.4.2  F-Testfor  Goodness  of  Fit 

The  purpose  of  the  multi-variable  linear  model  is  to  provide  a  fit  to  the  data 
that  is  more  accurate  than  a  simple  constant  predictor,  i.e.,  the  average  of  the  Y 
measurements.  In  other  words,  we  want  to  establish  whether  any  of  the  parameters 
a\,  am  provides  a  significant  improvement  over  the  constant  model  with  a\  — 

Cl  2  —  ...  —  Cl  /  j  i  —  0 . 

For  this  purpose  we  write  the  total  variance  of  the  data  as  follows: 

N  N  N 

X^1'  -  y )2  =  X^' -  >’i)2  +  Xbi  -  yf  (9.16) 

i=  1  i=l  i=  1 

where  %  —  yfa)  is  evaluated  for  the  best-fit  values  of  the  parameters  a^.  This 
equation  can  be  shown  to  hold  because  the  following  property  applies, 

N 

x<*  _  yd®  -  y) =  0  (9.17) 

i=i 

(see  Problem  9.7).  The  parent  variance  a2  of  the  data  is  unknown  and  it  is  not 
required  for  this  test.  We  therefore  ignore  it  for  the  considerations  that  follow  by 
setting  a2  =  1.  The  three  terms  in  (9.16)  are  interpreted  as  follows.  The  left-hand 
side  term  is  the  total  variance  of  the  data  and  it  is  distributed  like 

N 

S2  =  J2®~y)2^  X2(N-l).  (9.18) 

1=1 

The  total  variance  S 2  can  be  interpreted  as  the  variance  obtained  using  a  model 
with  ci\  —  ...  =  am  —  0,i.e.,a  constant  model  equal  to  the  average  of  the  Y 
measurements. 

The  first  term  on  the  right-hand  side  is  the  residual  variance  after  the  data  are  fit 
to  the  linear  model  and  it  follows  the  usual  j2  distribution 

N 

S2r  =  J2®-yd2~X2(N-m-l)  (9.19) 

i=  1 

because  of  the  m  +  1  parameters  used  in  the  fit.  This  is  the  usual  variance  obtained 
using  the  full  model  in  which  at  least  some  of  the  a^  parameters  are  not  equal  to 


zero. 
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Finally,  the  second  term  on  the  right-hand  side  can  be  interpreted  as  the  variance 
explained  by  the  best-fit  model  and  it  is  distributed  like 


N 


s2  =  ^(yi-y)2  ~  x2(m). 

i=  1 


(9.20) 


The  distribution  of  the  last  term  can  be  explained  by  the  independence  between  the 
two  variables  on  the  right-hand  side  of  the  equation  and  the  distribution  of  the  left- 
hand  side  term,  following  a  derivation  similar  to  that  of  Sect.  7.3.  Such  derivation  is 
not  discussed  in  this  book. 

The  variances  described  above  can  be  used  to  define  the  variable 


F=  &e/m  _  Ef=i(3,/-y)2/»t  (92l) 

Sj/(N-m-\)  YH=x(yi  -  W  /  (N  ~  rn  -  \) 

which  is  distributed  as  an  F  variable  with  m,  N  —  m—l  degrees  of  freedom  under  the 
null  hypothesis  that  a\  —  ...  —  am  =  0.  The  meaning  of  this  variable  is  the  ratio 
between  the  variance  explained  by  the  fit  and  the  residual  variance,  each  normalized 
by  the  respective  degrees  of  freedom.  A  large  value  of  this  ratio  is  desirable,  since  it 
means  that  the  model  does  a  good  job  at  explaining  the  variability  of  the  data. 

The  measurement  of  F  that  results  from  the  fit  of  a  dataset  to  the  multi- variable 
linear  model  can  therefore  be  used  to  test  the  null  hypothesis.  If  the  measurement 
exceeds  the  critical  value  of  the  F  distribution  for  the  desired  confidence  level,  the 
null  hypothesis  must  be  rejected  and  the  linear  model  is  considered  acceptable. 

Example  9.2  (F-Test  of  Iris  setosa  Data )  The  variances  for  the  Iris  setosa  data  are 
shown  in  Table  9.2.  The  variable  F  is 


3.50/2 

2.59/46 


20.76. 


(9.22) 


The  99%  (p  =  0.01)  critical  value  for  an  F  distribution  with  3,  46  degrees  of 
freedom  is  4.24.  Therefore  the  null  hypothesis  that  the  linear  model  does  not  provide 
a  significant  improvement  must  be  rejected.  In  practice,  this  means  that  the  linear 


Table  9.2  Variances  and  F-test  results  for  the  Iris  setosa  data 


Variances 

Value 

d.o.f 

F-test 

Value 

p  value 

s2 

6.09 

N  -  1  =49 

s2 

2.59 

N  —  m  —  1=46 

S 2 

3.50 

m  =  3 

S'i/m 

20.76 

1.2  x  10-8 

S2/(N  -m-  1) 
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model  is  warranted.  The  probability  to  exceed  the  measured  value  of  20.7  for  the 
test  statistic  is  1.2  x  10-8,  i.e.,  very  small.  <> 


9.4.3  The  Coefficient  of  Determination 


The  ratio  of  the  explained  variance  S2e  to  the  total  variance  S 2,  defined  as 


Sj  _  EL(y,--y)2 
52  Ef=i(y,-j)2 


(9.23) 


is  a  common  measure  of  the  ability  of  the  linear  model  to  describe  the  data.  This 
ratio  is  called  the  coefficient  of  (multiple)  determination  and  it  is  0  <  R2  <  1.  A 
value  close  to  1  indicates  that  the  model  describes  the  data  with  little  additional 
variance  left  unexplained. 

It  is  possible  to  relate  the  coefficient  R 2  to  the  F-test  variable  defined  in  (9.21) 
and  obtain  an  equivalent  test  for  the  multi- variable  regression  based  on  R 2  instead 
of  the  F  variable  (see,  e.g.,  [41]  and  [29]).  Since  the  two  quantities  are  related,  it 
is  sufficient  to  test  the  overall  multiple  regression  using  the  F  test  provided  in  the 
previous  section.  The  advantage  of  reporting  explicitly  a  value  for  R2  is  that  we 
can  identify  in  a  simple  way  the  amount  of  variance  that  remains  in  the  data  after 
performing  the  multiple  regression. 

Example  9.3  (R2  Value  for  the  Iris  setosa  Data )  We  can  use  the  data  in  Table  9.2 
to  calculate  a  coefficient  of  multiple  determination  R2  —  0.575.  This  number 
means  that  57.5  %  of  the  total  data  variance  is  explained  by  the  best-fit  regression 
model.  <> 

In  the  case  of  the  simple  linear  regression  with  just  one  independent  variable, 
y  =  a  +  bx,  the  coefficient  of  determination  is  the  same  as  the  coefficient  of  linear 
correlation  r  defined  earlier  in  (2.19)  (see  Problem  9.4).  In  this  case  it  is  possible  to 
test  the  significance  of  the  linear  model  using  either  the  correlation  coefficient  r  or 
the  F  test.  The  two  tests  will  be  equivalent. 

Example  9.4  ( Linear  Lit  to  the  Iris  setosa  Data  Using  a  Single  Independent 
Variable )  In  a  previous  example  we  have  shown  that  the  coefficients  of  multiple 
regression  for  the  variables  Petal  Length  and  Petal  Width  were  not  statistically 
significant,  according  to  the  t  test. 

Excluding  these  two  columns  of  data,  a  fit  to  the  function  y  —  a  +  bx,  where 
Y  is  the  Sepal  Length  and  X  the  Sepal  Width,  can  be  shown  to  return  the  values 
a  —  2.64  zb  0.31  and  b  =  0.69  ±  0.09  with  a  correlation  coefficient  of  r  =  0.7425 
or  a  value  of  F  —  58.99  for  1,  49  degrees  of  freedom  (see  Problem  9.5).  The  value 
of  r2  =  0.551  is  very  similar  to  that  obtained  from  the  full  fit  using  the  additional 
two  variables.  The  fact  that  the  reduction  in  r2  is  minimal  between  the  m  —  3  and 
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the  m  —  1  case  is  an  indication  that  the  Sepal  Length  can  be  predicted  with  nearly 
the  same  precision  using  just  the  Sepal  Width  as  an  indicator.  <> 

Summary  of  Key  Concepts  for  this  Chapter 

□  Multi-variable  dataset  Simultaneous  measurements  of  several  (>  2) 
variables,  usually  without  reference  to  a  specific  independent  variable. 

□  Multi-variable  linear  regression :  Extension  of  the  (multiple)  linear  regres¬ 
sion  to  the  case  of  multi- variable  data.  Best-fit  coefficients  are  given  by  the 
matrix  equation 

a  =  ( XTX)~lXTY . 

□  Coefficient  of  determination:  The  ratio  between  the  explained  variance  and 
total  variance  R2  =  S2 /S2  <  1 . 


Problems 

9.1  Calculate  the  best-fit  parameters  and  uncertainties  for  the  multi- variable  regres¬ 
sion  of  the  Iris  setosa  data  of  Fig.  9.1. 

9.2  Use  an  T  test  to  determine  whether  the  multi- variable  regression  of  the  Iris 
setosa  data  is  justified  or  not. 

9.3  Prove  that  (9.5)  and  (9.10)  are  equivalent.  Take  into  consideration  that  in  (9.5) 
the  vectors  a  and  /3  are  row  vectors.  You  may  re-write  (9.5)  using  column  vectors. 

9.4  Prove  that  the  coefficient  of  determination  R 2  for  the  simple  linear  regression 
y  —  a  +  bx  is  equivalent  to  the  sample  correlation  coefficient  of  (2.20). 

9.5  Fit  the  Iris  setosa  data  using  the  function  y  —  a  +  bx,  where  Y  is  the  Sepal 
Fength  and  X  the  Sepal  Width.  For  this  fit,  you  will  ignore  the  data  associated  with 
the  petal.  Determine  the  best-fit  parameters  of  the  linear  model  and  their  errors. 

9.6  Using  the  results  of  Problem  9.5,  determine  whether  there  is  sufficient  evidence 
for  the  use  of  the  simple  y  =  a  +  bx  model  for  the  data.  Use  a  confidence  level  of 
99  %  to  draw  your  conclusions. 


9.7  Prove  (9.17). 


Chapter  10 

Goodness  of  Fit  and  Parameter  Uncertainty 


Abstract  After  calculating  the  best-fit  values  of  model  parameters,  it  is  necessary 
to  determine  whether  the  model  is  actually  a  correct  description  of  the  data,  even 
when  we  use  the  best  possible  values  for  the  free  parameters.  In  fact,  only  when 
the  model  is  acceptable  are  best- fit  parameters  meaningful.  The  acceptability  of  a 
model  is  typically  addressed  via  the  distribution  of  the  y2  statistic  or,  in  the  case  of 
Poisson  data,  of  the  Cash  statistic.  A  related  problem  is  the  estimate  of  uncertainty 
in  the  best-fit  parameters.  This  chapter  describes  how  to  derive  confidence  intervals 
for  fit  parameters,  in  the  general  case  of  Gaussian  distributions  that  require  y2 
minimization,  and  for  the  case  of  Poisson  data  requiring  the  Cash  statistic.  We 
also  study  whether  a  linear  relationship  between  two  variables  is  warranted  at  all, 
providing  a  statistical  test  based  on  the  linear  correlation  coefficient.  This  is  a 
question  that  should  be  asked  of  a  two-variable  dataset  prior  to  any  attempt  to  fit 
with  a  linear  or  more  sophisticated  model. 


10.1  Goodness  of  Fit  for  the  y2 .  Fit  Statistic 

/>■  in  1  n 


For  both  linear  and  nonlinear  Gaussian  fits,  one  needs  to  establish  if  the  set  of  best- 
fit  parameters  that  minimize  y 2  are  acceptable,  i.e.,  if  the  fit  was  successful.  For  this 
purpose,  we  need  to  perform  a  hypothesis  testing  based  on  the  minimum  of  the  y2 
statistic  that  was  obtained  for  the  given  model.  According  to  its  definition, 


Y2  ■ 
Amin 


E 


(10.1) 


in  which  yt  —  yC*/)  I  best-fit  the  model  calculated  with  the  best-fit  parameters.  It 
is  tempting  to  say  that  the  /2  in  statistic  is  distributed  like  a  y2  random  variable 
(Sect.  7.2),  since  it  is  the  sum  of  N  several  random  variables,  each  assumed  to  be 
distributed  like  a  standard  normal.  If  the  function  y(v)  has  no  free  parameters,  this  is 
certainly  the  case,  and  it  would  be  also  clear  that  y2  will  have  N  degrees  of  freedom. 

The  complication  is  that  the  fit  function  has  m  free  parameters  that  were  adjusted 
in  such  a  way  as  to  minimize  the  y2.  This  has  two  implications  on  the  y2nin  statistic: 
the  free  parameters  will  reduce  the  value  of  y2  with  respect  to  the  case  in  which  no 
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free  parameters  were  present,  and,  more  importantly,  the  fit  function  y(v)  introduces 
a  dependence  among  the  N  random  variables  in  the  sum.  Given  that  the  Xmin  no 
longer  the  sum  of  N  independent  terms,  we  cannot  conclude  that  /2 ~  X2(N). 

It  can  be  shown  that  /2 *s  in  fact  s till  distributed  as  a  variable,  but  with 

f  —  N  —  m  (10.2) 


degrees  of  freedom.  This  result  applies  to  any  type  of  function /(v),  under  the 
assumptions  that  the  m  parameters  are  independent  of  one  another,  as  is  normally 
the  case  for  “meaningful”  fit  functions.  The  general  proof  of  this  statement  is  rather 
elaborate,  and  can  be  found  in  the  textbook  by  Cramer  [11].  Here  we  limit  ourselves 
to  provide  a  proof  for  a  specific  case  in  which /(v)  =  a ,  meaning  a  one-parameter 
fit  function  that  is  a  constant,  to  illustrate  the  reduction  of  degrees  of  freedom  from 
N  to  N  —  1  when  there  is  just  one  free  parameter  that  can  be  used  to  minimize  /2. 

Proof  When  performing  a  maximum  likelihood  fit  to  the  function  y(x)  =  a , 

we  have  shown  that  the  best-fit  parameter  is  estimated  as 


a 


1  N 

—  x  —  —  )  Xi, 

N  ^ 

i=  1 


under  the  assumption  that  all  measurements  are  drawn  from  the  same  distri¬ 
bution  N(/i,  a)  (see  Sect.  5.1).  Therefore,  we  can  write 


2  _  y^  (X[  —  fl)2  _  (X  —  \l)2  y^  (Xi  —  X)2  _  (V  —  fl)2  2 

X  —  2—/  ^2  _  a2  /ft  *"  2—,  ^2  _  -2  77T  ^  Xmin  • 


i=  1 


i=  1 


<J2/N 


This  equation  is  identical  to  the  relationship  used  to  derive  the  sampling 
distribution  of  the  variance,  (7.19),  and  therefore  we  can  directly  conclude 
that  Xmin  ~  /2(Af  —  1)  and  that  /2/n  and  x2  are  independent  random  variables. 
Both  properties  will  be  essential  for  the  calculation  of  confidence  intervals  on 
fit  parameters.  □ 

Now  that  the  distribution  function  of  the  fit  statistic  x2nin  is  known,  we  can  use 
the  hypothesis  testing  methods  of  Sect.  7.2.3  to  determine  whether  a  value  of  the 
statistic  is  acceptable  or  not.  The  null  hypothesis  that  the  data  are  well  fit  by,  or 
compatible  with,  the  model,  can  be  rejected  at  a  confidence  level  p  according  to  a 
one-tailed  test  defined  by 


1  -p 


hlrit 


f  .i  (/,  x)dx  =  P(x2(f )  >  xlrit) 


(10.3) 


The  value  /2nY  calculated  from  (10.3)  for  a  specified  value  of  p  defines  the  rejection 
region  x2nin  >  /2n-r  The  data  analyst  must  chose  a  value  of  p ,  say  p  —  0.9,  and 
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calculate  the  critical  value  x2crit  satisfies  (10-3),  using  Table  A. 7.  If  the  x2 
value  measured  from  the  data  is  higher  than  what  calculated  from  (10.3),  then  the 
hypothesis  should  be  rejected  at  the  p ,  say  90  %,  confidence  level.  On  the  other  hand, 
if  the  x2  value  measured  from  the  data  is  lower  than  this  critical  value,  the  hypothesis 
should  not  be  rejected,  and  the  fit  considered  as  consistent  with  the  model  or,  more 
precisely,  not  rejected,  at  that  confidence  level. 

Example  10.1  In  Fig.  10.1  it  is  shown  a  linear  fit  using  data  from  Table  6.1.  The 
quantity  Energy  1  is  used  as  the  independent  variable,  and  its  errors  are  neglected. 
The  quantity  Energy  2  is  the  dependent  variable,  and  errors  are  calculated  as 
the  average  of  the  positive  and  negative  error  bars.  The  best-fit  linear  model  is 
represented  as  the  dotted  line,  for  a  fit  statistic  of  /2/w  =  60.5  for  23  degrees  of 
freedom.  The  value  of  the  fit  statistic  is  too  large,  and  the  linear  model  must  be 
discarded  (see  Appendix  A. 3  for  critical  values  of  the  /2  distribution). 

Despite  failing  the  x2min  test,  the  best-fit  model  appears  to  be  a  reasonable  match 
to  the  data.  The  large  value  of  the  test  statistic  are  clearly  caused  by  a  few  datapoints 
with  small  error  bars,  but  there  appears  to  be  no  systematic  deviation  from  the  linear 
model.  One  reason  for  the  poor  fit  statistic  could  be  that  errors  in  the  independent 
variables  were  neglected.  In  Chap.  12  we  explain  an  alternative  fitting  method  that 
takes  into  account  errors  in  both  variables.  Another  possibility  for  the  poor  fit  is 
that  there  are  other  sources  of  error  that  are  not  accounted.  This  additional  errors 
are  often  referred  to  as  systematic  errors.  In  Chap.  1 1  we  address  the  presence  of 
systematic  errors  and  how  one  can  handle  the  presence  of  such  errors  in  the  fit.  O 


Fig.  10.1  Linear  fit  to  the 
data  of  Table  6.1.  We 
assumed  that  the  independent 
variable  is  Energy  1,  errors 
for  this  variable  were 
neglected  in  the  fit.  Note  the 
logarithmic  scale  for  both 
axes 
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10.2  Goodness  of  Fit  for  the  Cash  C  Statistic 


In  the  case  of  a  Poisson  dataset  (Sect.  8.8)  the  procedure  to  determine  whether  the 
best-fit  model  is  acceptable  is  identical  to  that  for  Gaussian  data,  provided  that  the 
X1  fit  statistic  is  replaced  by  the  Cash  statistic  C,  defined  by 

C  =  — 21n  j£f  —  B, 


where 


N  N  N 

B  =  2'Y^yi  -  2  y ^jj  In  y,  +  2y^lny,!. 

i=  1  i=  1  i=  1 


We  now  prove  that  C  is  approximately  distributed  like  a  / 2  distribution  with 
N  —  m)  degrees  of  freedom.  This  is  an  important  result  that  lets  us  use  the  Cash 
statistic  C  in  the  same  way  as  the  /2  statistic. 

Proof  We  start  with 


N 


N 


N 


-2  InJz?  =  -2  [  yU',  In  y(x,  )  -  -  ^lny,  ! 


i=  1 


i=  1 


i=  1 


and  rewrite  as 


—2  InJz? 


V/  In  v,  +  In  >/!  )  . 


In  order  to  find  an  expression  that  asymptotically  relates  C  to  /2,  define  d  = 
yi  —  \’ (  v, )  as  the  “average”  deviation  of  the  measurement  from  the  parent  mean. 
It  is  reasonable  to  expect  that 

d_  ^  _J_  y(xd  _  w-d  _  j  _  d_ 

yt  ~  s/yi  yi  yi  yt 

where  yi  is  the  number  of  counts  in  that  specific  bin.  It  follows  that 


—2 InJz?  =  2  yy  (vU,)  -  y,  ln(  1 - )  -  y,  In y,  +  In  \  ;! 

;=  1  '  y 1 

~2Y1  P’(x‘)  “  yi  i~J  ~  5  (“)  I  -  yi  ln.v,  +  In  v, ! 


N 


2 +  -  y(Xi^  +  \  — — v(x'^  _  yi ln  +  ln)v! 


i=  1 


yi 
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The  quadratic  term  can  now  be  written  in  such  a  way  that  the  denominator 
carries  the  termy(x/)  =  a2: 

(yt  -yfe))2  =  Cy.  -  yfe»2  =  (y.  -  yfe))2  / _d_  +  yfe)  \ 

>’,■  d  +  y(xj )  yOO  Vyte)  y(x,) ) 

_  Cy« - y(*0)2  L _ d_\ 

~  y(xd  v  y(xi)J' 

We  therefore  conclude  that 

(yi~y(xd)2  f  d 
-21nJ£?  =  V  ^  ,  1 — — 

y(xd  V  y(xd 

(N  N  N  N 

2&-2E  y,  lny,+2y  lnyd  J  , 

i=i  i=  l  i=i  / 

showing  that,  within  the  multiplicative  terms  (1  —  <i/y(v/)),  the  variable  C  = 

—2  In  —  B  has  a  yj  distribution  with  N  —  m  degrees  of  freedom.  □ 

For  the  purpose  of  finding  the  best-fit  parameters  via  minimization  of  the  fit 
statistic,  the  constant  term  B  is  irrelevant.  However,  in  order  to  determine  the 
goodness  of  fit  and  confidence  intervals,  it  is  important  to  work  with  a  statistic  that 
is  distributed  as  a  y2  variable.  Therefore  the  Cash  statistic  is  defined  as 


N 

i=  1 


(10.4) 


Example  10.2  Consider  an  ideal  set  of  N  =  10  identical  measurements  ,  yt  —  1. 
For  a  fit  to  a  constant  model,  y  —  a,  it  is  clear  that  the  best- fit  model  parameter  must 
be  a  —  1.  Using  the  Cash  statistic  as  redefined  by  (10.4),  we  find  that  C  —  0,  since 
the  data  and  the  model  match  exactly.  A  similar  result  would  be  obtained  if  we  had 
assumed  a  Gaussian  dataset  of  yt  —  1  and  cr,  =  1,  for  which  y2  —  0.  <> 


10.3  Confidence  Intervals  of  Parameters  for  Gaussian  Data 

In  this  section  we  develop  a  method  to  calculate  confidence  intervals  on  model 
parameters  assuming  a  Gaussian  dataset.  The  results  will  also  be  applicable  to 
Poisson  data,  provided  that  the  y2  statistic  is  replaced  with  the  Cash  C  statistic 
(see  Sect.  10.4). 
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Under  the  assumption  that  a  given  model  with  m  parameters  is  the  correct 
description  of  the  data,  the  fit  statistic  x2  calculated  with  these  fixed  true  values, 


(10.5) 


is  distributed  as  /2(A/),  i.e.,  we  expect  random  variations  of  the  measurement  of 
X2true  according  to  a  x2  distribution  with  N  degrees  of  freedom.  This  is  so  because  the 
true  parameters  are  fixed  and  no  minimization  of  the  x2  function  can  be  performed. 
The  quantity  x2rue  clearly  only  a  mathematical  construct,  since  the  true  values 
of  the  parameters  are  unknown.  One  does  not  expect  that  x2rue  ~  meaning  a 
perfect  match  between  the  data  and  the  model.  In  fact,  even  if  the  model  was  correct, 
statistical  fluctuations  will  result  in  random  deviations  from  the  parent  model. 

On  the  other  hand,  when  finding  the  best- fit  parameters  cii,  we  calculate  the 
statistic: 


y2  . 

A  min 


(10.6) 


which  minimizes  /2  with  respect  to  all  possible  free  parameters.  In  this  case,  we 
know  that  /2/n  ~  X2(N  ~  m )  from  the  discussion  in  Sect.  10.1.  It  is  also  clear  that 
the  values  of  the  best-fit  parameters  are  not  identical  to  the  true  parameters,  again 
for  the  presence  of  random  fluctuations  of  the  datapoints. 

After  finding  /2  in,  any  change  in  the  parameters  (say,  from  to  a'k)  will  yield  a 
larger  value  of  the  test  statistic,  x2  >  Tm[n  •  We  want  to  test  whether  the  new  set  of 
parameters  a'k  can  be  the  true  (yet  unknown)  values  of  the  parameters,  e.g.,  whether 
the  corresponding  x2  can  be  considered  x2nie  •  For  this  purpose  we  construct  a  new 
statistic: 


AX2  =  X2  ~  Xniin  (10-7) 

where  x2  is  obtained  for  a  given  set  of  model  parameters  and,  by  definition,  Ax2  is 
always  positive.  The  hypothesis  we  want  to  test  is  that  x2  is  distributed  like  x2rue > 
i.e.,  the  /2  calculated  using  a  new  set  of  parameters  is  consistent  with  X2rue-  Since 
X2true  and  Xmin  are  independent  (see  Sect.  10.1),  we  conclude  that 

Ax2  ~  X2(m)  (10.8) 

when  x~  ~  X2rUe’  Equation  (10.8)  provides  a  quantitative  way  to  determine 
how  much  /2  can  increase,  relative  to  /2  -n,  and  still  the  value  of  x2  remaining 
consistent  with  xjrue •  The  method  to  use  (10.8)  for  confidence  intervals  on  the  model 
parameters  is  described  below. 
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10.3.1  Confidence  Interval  on  All  Parameters 

Equation  (10.8)  provides  a  quantitative  method  to  estimate  the  confidence  interval 
on  the  m  best-fit  parameters.  The  value  of  A  j2  is  expected  to  follow  the  /2(m) 
distribution.  This  means  that  one  can  tolerate  deviations  from  the  best-fit  values  of 
the  parameters  leading  to  an  increase  in  /2,  provided  such  increase  is  consistent 
with  the  critical  value  of  the  respective  A/2  distribution.  For  example,  in  the  case 
of  a  model  with  m  —  2  free  parameters,  one  can  expect  a  change  A/2  >4.6  for 
p  —  0.9  confidence,  or  for  a  model  with  m  —  1  parameter  a  change  A/2  >  2.7  (see 
Table  A.7). 

The  method  to  determine  the  confidence  interval  on  the  parameters  starts  with 
the  value  of  /2  in.  From  this,  one  constructs  an  ra-dimensional  volume  bounded  by 
the  surface  of  Ay1  —  /2  —  x2nin  —  X2crir  where  /2r 7>  is  the  value  that  corresponds  to 
a  given  confidence  level  p  for  m  degrees  of  freedom,  as  tabulated  in  Table  A.7.  The 
surface  of  this  ra-dimensional  volume  marks  the  boundaries  of  the  rejection  region 
at  the  p  level  (say  p  =  90%)  for  the  m  parameters,  i.e.,  the  parameters  can  vary 
within  this  volume  and  still  remain  an  acceptable  fit  to  the  data  at  that  confidence 
level.  In  practice,  a  surface  at  fixed  A/2  —  X2crit  can  be  calculated  by  a  grid 
of  points  around  the  values  that  correspond  to  /2  in .  This  calculation  can  become 
computationally  intensive  as  the  number  of  parameters  m  increases.  An  alternative 
method  to  estimate  confidence  intervals  on  fit  parameters  that  makes  use  of  Monte 
Carlo  Markov  chains  (see  Chap.  16)  will  overcome  this  limitation. 

Example  10.3  Consider  the  case  of  a  linear  fit  to  the  data  of  Table  10.1.  According 
to  the  data  in  Table  10.1,  one  can  calculate  the  best-fit  estimates  of  the  parameters 
as  a  —  23.54  =b  4.25  and  b  —  13.48  ±  2.16,  using  (8.7)  and  (8.22).  The  best-fit 
line  is  shown  in  Fig.  10.2.  There  is  no  guarantee  that  these  values  are  in  fact  the  true 
values:  they  are  only  the  best  estimates  based  on  the  maximum  likelihood  method. 
For  these  best-fit  values  of  the  coefficients,  the  fit  statistic  is  =  0.53,  for /  =  3 
degrees  of  freedom,  corresponding  to  a  probability  p  —  0.09  (i.e.,  a  probability 
P(X20)  >  0.53  =  0.91).  The  fit  cannot  be  rejected  at  any  reasonable  confidence 
level,  since  the  probability  to  exceed  the  measured  x2nin  is  so  high. 

We  now  sample  the  parameter  space,  and  determine  variations  in  the  fit  statistic 
X2  around  the  minimum  value.  The  result  is  shown  in  Fig.  10.3,  in  which  the 
contours  mark  the  /+„  +  1.0,  xlun  +  2.3  and  /+„  +  4.6  boundaries.  In  this 


Table  10.1  Data  used  to 
illustrate  the  linear 
regression,  and  the  estimate 
of  confidence  intervals  on  fit 
parameters 
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Fig.  10.2  Best-fit  linear 
model  (< dashed  line )  to  the 
data  of  Table  10.1.  The 
Xmin  =0.53  indicates  a  very 
good  fit  which  cannot  be 
rejected  at  any  reasonable 
confidence  level 
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application,  m  —  2,  a  value  of  A/2  —  4.6  or  larger  is  expected  10  %  of  the  time. 
Accordingly,  the  A/2  —  4.6  contour  marks  the  90%  confidence  surface:  the  true 
values  of  a  and  b  are  within  this  area  90  %  of  the  time,  if  the  null  hypothesis  that 
the  model  is  an  accurate  description  of  the  data  is  correct.  This  area  is  therefore  the 
90  %  confidence  area  for  the  two  fitted  parameters.  <> 


10.3.2  Confidence  Intervals  on  Reduced  Number 
of  Parameters 

In  the  case  of  a  large  number  m  of  free  parameters,  it  is  customary  to  report  the 
uncertainty  on  each  of  the  fitted  parameters  or,  in  general,  on  just  a  subset  of  /  <  m 
parameters  considered  to  be  of  interest.  In  this  case,  the  /  parameters  a\ , . . . ,  ai  are 
said  to  be  the  interesting  parameters,  and  the  remaining  m  —  l  parameters  are  said 
to  be  uninteresting.  This  can  be  thought  of  as  reducing  the  number  of  parameters 
of  the  model  from  m  to  /,  often  in  such  a  way  that  only  one  interesting  parameter 
is  investigated  at  a  time  (/  =  1).  This  is  a  situation  that  is  of  practical  importance 
for  several  reason.  First,  it  is  not  convenient  to  display  surfaces  in  more  than  two  or 
three  dimensions.  Also,  sometimes  there  are  parameters  that  are  truly  uninteresting 
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Parameter  a 


Fig.  10.3  Contours  of  A/2  =  1.0,  2.3  and  4.6  (from  smaller  to  larger  areas).  For  the  example  of 
m  =  2  free  parameters,  the  contours  mark  the  area  within  which  the  true  parameters  a  and  b  are 
expected  to  be  with,  respectively,  25,  67,  and  90  %  probability 


to  the  interpretation  of  the  data,  although  necessary  for  its  analysis.  One  case  of  this 
is  the  presence  of  a  measurement  background,  which  must  be  taken  into  account 
for  a  proper  analysis  of  the  data,  but  it  is  of  no  interest  in  the  interpretation  of  the 
results. 

New  considerations  must  be  applied  to  xjrue  and  Parent  distribution  in  this 
situation.  We  find  /2 -n  in  the  usual  way,  that  is,  by  fitting  all  parameters  and 
adjusting  them  until  the  minimum  /2  is  found.  Therefore  /2/>?  continues  to  be 
distributed  like  /2(7V  —  m).  For  /2M£,,  we  want  to  ignore  the  presence  of  the 
uninteresting  parameters.  We  do  so  by  assuming  that  the  /  interesting  parameters  are 
fixed  at  the  true  values  and  marginalize  over  the  m  —  l  uninteresting  parameters.  This 
process  of  marginalization  means  that  we  let  the  uninteresting  parameters  adjust 
themselves  to  the  values  that  yield  the  lowest  value  of  /2.  This  process  ensures  that 
Xmte  c*  X2(n  -  O  -  /))•  Notice  that  the  marginalization  does  not  mean  fixing  the 
values  of  the  uninteresting  parameters  to  their  best-fit  values. 

In  summary,  the  change  in  /2  that  can  be  tolerated  will  therefore  be  Z\/2  = 

xlue  -  x2min > in  which  xLe  oc  /2(V  -  (m  -  l ))  and  Xmin  ccX2(N-  m). Since  the  two 
X2  distributions  are  independent  of  one  another,  it  follows  that 

~  X2(l)  (10.9) 

where  /  is  the  number  of  interesting  parameters.  The  process  of  finding  confidence 
intervals  for  a  reduced  number  of  parameters  is  illustrated  in  the  following  example 
of  a  model  with  m  —  2  free  parameters,  for  which  we  also  find  confidence  intervals 
for  one  interesting  parameters  at  a  time. 
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Example  10.4  Consider  the  case  in  Fig.  10.3,  and  assume  that  the  interesting 
parameter  is  a.  The  /2 //?  f°r  each  value  of  a  is  done  by  searching  the  minimum 
X2  long  a  vertical  line  (i.e.,  for  a  fixed  value  of  a).  The  best-fit  value  of  a  is  already 
known,  marked  by  a  cross  in  Fig.  10.3.  When  seeking  the  68  %  confidence  interval 
for  the  interesting  parameter  a ,  the  limiting  values  of  a  are  those  on  either  side 
of  the  best-fit  value  that  result  in  a  minimum  /2  value  of  x2min  +  1-0  (where  x2/;? 
is  the  global  minimum).  Therefore,  the  68  %  confidence  interval  for  a  is  found  by 
projecting  the  x~m-m  +  1.0  contour  along  the  a  axis.  That  is  to  say,  we  find  the  smallest 
and  largest  values  of  a  along  the  /2 +  1.0  contour,  which  is  the  innermost  contour 
in  Fig.  10.3.  Likewise  the  projection  of  the  same  contour  along  the  b  axis  gives  the 
68  %  confidence  interval  on  b ,  when  considered  as  the  only  interesting  parameter. 

On  the  other  hand,  the  2-dimensional  68  %  confidence  surface  on  a ,  b  was  given 
by  the  /2  +  2.3  contour.  It  is  important  not  to  confuse  those  two  confidence  ranges, 
both  at  the  same  level  of  confidence  of  68  %.  The  reason  for  the  difference  (y2  •  + 
1.0  for  one  interesting  parameter  vs.  /2  +  2.3  for  two  interesting  parameters)  is  the 
numbers  of  degrees  of  freedom  of  the  respective  Ax2.  <> 

This  procedure  for  estimation  of  intervals  on  a  reduced  number  of  parameters 
was  not  well  understood  until  the  work  of  Lampton  and  colleagues  in  1976  [27].  It 
is  now  widely  accepted  as  the  correct  method  to  estimate  errors  in  a  subset  of  the 
model  parameters. 


10.4  Confidence  Intervals  of  Parameters  for  Poisson  Data 

The  fit  to  Poisson  data  was  described  in  Sect.  8.8.  Since  the  Cash  statistic  C  follows 
approximately  the  x2  distribution  in  the  limit  of  a  large  number  of  datapoints,  then 
the  statistic 


AC  —  Ctrue  —  Cynin  (10.10) 

has  the  same  statistical  properties  as  the  Ax2  distribution.  Parameter  estimation  with 
Poisson  statistic  therefore  follows  the  same  rules  and  procedures  as  with  the  x~ 
statistic. 

Example  10.5  Consider  an  ideal  dataset  of  N  identical  measurement  yy  =  1.  We 
want  to  fit  the  data  to  a  constant  model  y  —  a,  and  construct  a  1-a  confidence 
interval  on  the  fit  parameter  a  using  both  the  Poisson  fit  statistic  C,  and  the  Gaussian 
fit  statistic  x2.  In  the  case  of  the  Poisson  statistic,  we  assume  that  the  measurements 
are  derived  from  a  counting  experiment,  that  is,  a  count  of  1  was  recorded  in  each 
case.  In  the  case  of  Gaussian  variables,  we  assume  uncertainties  of  07  =  1. 

In  the  case  of  Poisson  data,  we  use  the  Cash  statistic  defined  in  (10.4).  The  best- 
fit  value  of  the  model  is  clearly  a  —  1 ,  and  we  want  to  find  the  value  8  corresponding 
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to  a  change  in  C  by  a  value  of  1  with  respect  to  the  minimum  value  Cm/n, 

AC  =  1  =>•  -2  ^  1  In  ^  i  +  2  ^(1  +  S  -  1)  =  1 
Using  the  approximation  ln(l  +5)  ~  (8  —  82  / 2),  we  find  that 


-208  +  10 82  +  208  =  1  =>  8  = 

This  shows  that  the  68  %  confidence  range  is  between  1  —  1/10  and  1  +  y/TJlO, 

or  1  ±  /TTTO. 

Using  Gaussian  errors,  we  calculate  A/2  —  1,  leading  to  10£2  =  1,  and  the  same 
result  as  in  the  case  of  the  Poisson  dataset.  <> 


10.5  The  Linear  Correlation  Coefficient 


We  want  to  define  a  quantity  that  describes  whether  there  is  a  linear  relationship 
between  two  random  variables  X  and  Y.  This  quantity  is  based  on  the  slopes  of  two 
linear  fits  of  X  and  Y,  using  each  in  turn  as  the  independent  variable.  Call  b  the  slope 
of  the  regression  y  =  a  +  bx  (where  X  is  the  independent  variable)  and  b '  the  slope 
of  the  regression  x  —  a'  +  b'y  (where  Y  is  the  independent  variable)  and  assume  that 
there  are  N  measurements  of  the  two  variables.  The  linear  correlation  coefficient  r 
is  defined  as  the  product  of  the  slopes  of  the  two  fits  via 


_ (NJ2xiyi  -  J2xiJlyi)2 _ 

(xL*?-(i:x')2)(#Zy?-(Zy')2) 


(10.11) 


in  which  we  have  used  the  results  of  (8.23).  It  is  easy  to  show  that  this  expression 
can  be  rewritten  as 


(Z(xi  ~x)(yj  -y)f 

Z(*i  -  x)2  J2(yt  -  y)2 


(10.12) 


and  therefore  r  is  the  sample  correlation  coefficient  as  defined  in  (2.20). 

Consider  as  an  example  the  data  from  Pearson’s  experiment  at  page  30.  The 
measurement  of  mother’s  and  father’s  height  are  likely  to  have  the  same  uncertainty, 
since  one  expects  that  both  women  and  men  followed  a  similar  procedure  for  the 
measurement.  Therefore  no  precedence  should  be  given  to  either  when  assigning 
the  tag  of  “independent”  variable.  Instead,  one  can  proceed  with  two  separate  fits: 
one  in  which  the  father’s  height  (X)  is  considered  as  the  independent  variable,  or  the 
regression  of  Y  on  X  (dashed  line),  and  the  other  in  which  the  mother’s  height  (T)  is 
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Fig.  10.4  Linear  regressions 
based  on  the  data  collected  by 
Pearson,  Table  2.3  at  page  30. 
Larger  circles  indicate  a 
higher  number  of  occurrence 
for  that  bin 


o 


LO 


o 


LO 

LO 


60  65  70  75 


Father  height 


the  independent  variable,  or  linear  regression  of  X  on  Y  (dot-dash  line).  The  two  fits 
are  reported  in  Fig.  10.4,  obtained  by  maximum  likelihood  method  assuming  equal 
errors  for  the  dependent  variables. 

If  the  two  variables  X  and  Y  are  uncorrelated,  then  the  two  best-fit  slopes  b  and 
b '  are  expected  to  be  zero.  In  fact,  as  one  variable  varies  through  its  range,  the 
other  is  not  expected  to  either  decrease  (negative  correlation)  or  increase  (positive 
correlation),  resulting  in  null  best- fit  slopes  for  the  two  fits.  We  therefore  expect 
the  sample  distribution  of  r  to  have  zero  mean,  under  the  null  hypothesis  of  lack 
of  correlation  between  X  and  Y.  If  there  is  a  true  linear  correlation  between  the 
two  variables,  i.e.,  y  =  a  +  bx  is  satisfied  with  b  ^  0,  then  it  is  also  true  that 
v  =  a'  +  b'x  —  — a/b  +  l /by.  In  this  case  one  therefore  expects  bb'  —  r2  =  1. 


10.5.1  The  Probability  Distribution  Function 


A  quantitative  test  for  the  correlation  between  two  random  variables  requires  the 
distribution  function  fr(r).  We  show  that  the  probability  distribution  of  r,  under  the 
hypothesis  that  the  two  variables  X  and  Y  are  uncorrelated,  is  given  by 


(10.13) 


where  f  =  N  —  2  is  the  effective  number  of  degrees  of  freedom  of  a  dataset  with 
N  measurements  of  the  pairs  of  variables.  The  form  of  the  distribution  function  is 
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reminiscent  of  the  t  distribution,  which  in  fact  plays  a  role  in  the  determination  of 
this  distribution. 

Proof  The  proof  starts  with  the  determination  of  the  probability  distribution 
function  of  a  suitable  function  of  r,  and  then,  by  change  of  variables,  the 
distribution  of  r  is  obtained. 

The  best-fit  parameter  b  is  given  by 


,2  =  (NJ2xiyj  -  E^'EyQ2  =  (E(*«  -  *)Cy-  -  y))2 

0v£E-G»2)2  (£(*,■ -W  ’ 

and  accordingly  we  obtain 


(£(*■  ~  x)(yt  -  y))2  =  ,2  Efe  ~  x)2 
£(*;  -  x)2  J2(yt  -  y)2  EC v;  -  y)2 ' 


(10.14) 


Also,  using  (8.5),  the  best- fit  parameter  a  can  be  shown  to  be  equal  to  a  — 
y  —  bx,  and  therefore  we  obtain 


s2  =  £(y,  -a-  bxj)2  =  YSy,  -  y)2  -  b2  -  x)2.  (10.15) 


Notice  that  S2/o 2  =  where  a2  is  the  common  variance  of  the  Y 

measurements,  and  therefore  using  (10.14)  and  (10.15)  it  follows  that 


S2 

E  (yt-y)2 


1  —  r 


2 


or,  alternatively, 


r 


fr/Efe  -*)2 

5 


(10.16) 


Equation  (10.16)  provides  the  means  to  determine  the  distribution  function  of 
r/Vl— r2.  First,  notice  that  the  variance  of  b  is  given  by 


cr 


-  x)2 


According  to  (8.35),  s2  =  S2  /(N  —  2)  is  the  unbiased  estimator  of  the  variance 
a2,  since  two  parameters  have  been  fit  to  the  data.  Assuming  that  the  true 
parameter  for  the  slope  of  the  distribution  is  /3,  then 


b  —  (3  b  —  /3 

Ob 


<VVE(*i-*)2 


~  N(0, 1) 
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is  therefore  distributed  like  a  standard  Gaussian.  In  the  earlier  equation,  if  we 
replace  a2  with  the  sample  variance  s2  =  S2/ (N  —  2),  and  enforce  the  null 
hypothesis  that  the  variables  X  and  Y  are  uncorrelated  (/3  =  0),  we  obtain 
a  new  variable  that  is  distributed  like  a  t  distribution  with  N  —  2  degrees  of 
freedom, 


b- Vn^2 -  X)2  ~  t(N  -  2). 
Using  (10.16),  we  find  that  the  variable 


rVN-  2 
Vi  -  r2 


(10.17) 


is  distributed  like  a  t  distribution  with /  —  N  — 2  degrees  of  freedom  and,  since 
it  is  a  monotonic  function  of  r,  its  distribution  can  be  related  to  the  distribution 
fr(r)  via  a  simple  change  of  variables,  following  the  method  described  in 
Sect.  4.4.1. 

Starting  with  v  —  rVN  —  2/  Vl  —  r2,  and 


fr(v)  = 


1  r((f +l)/2) 

V^f  r(f/2) 


/+ 1 


with 


dv  _  VN-2 
dr  (1  —  r2)3/2  ’ 


the  equation  of  change  of  variables  fr(r)  =  fj{v)dv / dr  yields  (10.13)  after  a 
few  steps  of  algebra.  □ 


10.5.2  Hypothesis  Testing 

A  test  for  the  presence  of  linear  relationship  between  two  variables  makes  use  of  the 
distribution  function  of  r  derived  in  the  previous  section.  In  the  absence  of  linear 
relationship,  we  expect  a  value  of  r  close  to  0,  while  values  close  to  the  extremes  of 
=t  1  indicate  a  strong  correlation  between  the  two  variables.  Since  the  null  hypothesis 
is  that  there  is  no  correlation,  we  use  a  two-tailed  test  to  define  the  critical  value  of 
r  via 


P(\r\  >  rcrit)  =  1  - 


/ 


rcrit 


fr(r')diJ 


r crit 


=  1  -P, 


(10.18) 


10.5  The  Linear  Correlation  Coefficient 


191 


where  p  is  intended,  as  usual,  as  a  number  close  to  1  (e.g.,  p=0.9  or  90%  confi¬ 
dence).  Critical  values  of  r  for  various  probability  levels  are  listed  in  Table  A. 24. 

If  the  measured  value  of  r  exceeds  the  critical  value,  the  null  hypothesis  must  be 
discarded.  This  is  an  indication  that  there  is  a  linear  relationship  between  the  two 
quantities  and  further  modelling  of  Y  vs.  X  or  X  vs.  Y  is  warranted.  In  practice, 
the  linear  correlation  coefficient  test  should  be  performed  prior  to  attempting  any 
regression  between  the  two  variables. 

Example  10.6  The  two  fits  to  the  data  from  Pearson’s  experiment  (page  30)  are 
illustrated  in  Fig.  10.4.  A  linear  regression  provides  a  best-fit  slope  of  b  —  0.25 
(dashed  line)  and  of  h'  —  0.33  (dot-dash  line),  respectively,  when  using  the  father’s 
stature  (x  axis)  or  the  mother’s  stature  as  the  independent  variable.  For  these  fits 
we  use  the  data  provided  in  Table  2.3.  Each  combination  of  father-mother  heights 
is  counted  a  number  of  times  equal  to  its  frequency  of  occurrence,  for  a  total  of 
N  =  1, 079  datapoints. 

The  linear  correlation  coefficient  for  these  data  is  r  =  0.29,  which  is  also  equal 
to  \[bb' .  For  N  —  1,079  datapoints,  Table  A. 24  indicates  that  the  hypothesis  of 
no  correlation  between  the  two  quantities  must  be  discarded  at  >  99  %  confidence, 
since  the  critical  value  at  99  %  confidence  is  ~0.081,  and  our  measurement  exceeds 
it.  As  a  result,  we  conclude  that  the  two  quantities  are  likely  to  be  truly  correlated. 
The  origin  of  the  correlation  is  probably  with  the  fact  that  people  have  a  preference 
to  marry  a  person  of  similar  height,  or  more  precisely,  a  person  of  a  height  that  is 
linearly  proportional  to  their  own.  O 


Summary  of  Key  Concepts  for  this  Chapter 

□  The  x2min  statistic :  It  applies  to  Gaussian  data  and  it  is  distributed  like  a 
distribution  with  N  —  m  degrees  of  freedom. 

□  The  Cash  statistic :  It  applies  to  Poisson  data  and  it  is  defined  as 

c  =  -2  Y  >’i  Hvfc  )/>',)  +  2  yy.yCt,)  -  yi). 

It  is  approximately  distributed  like  . 

□  Confidence  intervals  for  x2nin  statistic :  They  are  obtained  from  the  condi¬ 
tion  that  Ax2  ~  X2(m )>  where  m  is  the  number  of  parameters  of  interest. 

□  Interesting  parameters :  A  subset  of  all  model  parameters  for  which  we  are 
interested  in  calculating  confidence  intervals. 

□  Linear  correlation  coefficient :  The  quantity  —  1  <  r  <  1  that  determines 
whether  there  is  a  linear  correlation  between  two  variables. 
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Problems 

10.1  Use  the  same  data  as  in  Problem  8.2  to  answer  the  following  questions. 

(a)  Plot  the  2-dimensional  confidence  contours  at  68  and  90%  significance,  by 
sampling  the  ( a,b )  parameter  space  in  a  suitable  interval  around  the  best-fit 
values. 

(b)  Using  a  suitable  2-dimensional  confidence  contour,  determine  the  68  %  con¬ 
fidence  intervals  on  each  parameter  separately,  and  compare  with  the  analytic 
results  obtained  from  the  linear  regression  method. 

10.2  Find  the  minimum  y2  of  the  linear  fit  to  the  radius  vs.  ratio  data  of  Table  6.1 
and  the  number  of  degrees  of  freedom  of  the  fit.  Determine  if  the  null  hypothesis 
can  be  rejected  at  the  99  %  confidence  level. 

10.3  Consider  a  simple  dataset  with  the  following  measurements,  assumed  to  be 
derived  from  a  counting  process.  Show  that  the  best-fit  value  of  the  parameter  a  for 


X 

y 

0 

l 

1 

l 

2 

l 

the  model  y  —  eax  is  a  —  0  and  derive  its  68  %  confidence  interval. 

10.4  Consider  the  same  dataset  as  in  Problem  10.3  but  assume  that  the  y  mea¬ 
surements  are  Gaussian,  with  variances  equal  to  the  measurements.  Show  that  the 
confidence  interval  of  the  best-fit  parameter  a  —  0  is  given  by  oa  =  ^  1/5. 

10.5  Consider  the  same  dataset  as  in  Problem  10.3  but  assume  a  constant  fit 
function,  y  —  a.  Show  that  the  best-fit  is  given  by  a  —  1  and  that  the  68% 
confidence  interval  corresponds  to  a  standard  deviation  of  y/l/3. 

10.6  Consider  the  biometric  data  in  Pearson’s  experiment  (page  30).  Calculate  the 
average  father  height  (X  variable)  for  each  value  of  the  mother’s  height  (Y  variable), 
and  the  average  mother  height  for  each  value  of  the  father’s  height.  Using  these  two 
averaged  datasets,  perform  a  linear  regression  of  Y  on  X,  where  Y  is  the  average 
value  you  have  calculated,  and,  similarly,  the  linear  regression  of  X  on  Y.  Calculate 
the  best-fit  parameters  a ,  b  (regression  of  Y  on  X)  and  a ',  b'  (regression  of  X  on 
Y),  assuming  that  each  datapoint  in  your  two  sets  has  the  same  uncertainty.  This 
problem  is  an  alternative  method  to  perform  the  linear  regressions  of  Fig.  10.4,  and 
it  yields  similar  results  to  the  case  of  a  fit  to  the  “raw”  data,  i.e.,  without  averaging. 

10.7  Calculate  the  linear  correlation  coefficient  for  the  data  of  Hubble’s  experiment 
(logarithm  of  velocity,  and  magnitude  m),  page  157.  Determine  whether  the 
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hypothesis  of  uncorrelation  between  the  two  quantities  can  be  rejected  at  the  99  % 
confidence  level. 

10.8  Use  the  data  from  Table  6. 1  for  the  radius  vs.  ratio,  assuming  that  the  radius  is 
the  independent  variable  with  no  error.  Draw  the  68  and  90  %  confidence  contours 
on  the  two  fit  parameters  a  and  b ,  and  calculate  the  68  %  confidence  interval  on  the 
b  parameter. 


Chapter  11 

Systematic  Errors  and  Intrinsic  Scatter 


Abstract  Certain  types  of  uncertainty  are  difficult  to  estimate  and  may  not  be 
accounted  in  the  initial  error  budget.  This  sometimes  leads  to  a  poor  goodness-of-fit 
statistic  and  the  rejection  of  the  model  used  to  fit  the  data.  These  missing  sources 
of  uncertainty  may  either  be  associated  with  the  data  themselves  or  with  the  model 
used  to  describe  the  data.  In  both  cases,  we  describe  methods  to  account  for  these 
errors  and  ensure  that  hypothesis  testing  is  not  biased  by  them. 


11.1  What  to  Do  When  the  Goodness-of-Fit  Test  Fails 

The  first  step  to  ensure  that  a  dataset  is  accurately  described  by  a  model  is  to  test  that 
the  goodness-of-fit  statistic  is  acceptable.  For  example,  when  the  data  have  Gaussian 
errors,  xhin  can  be  used  as  the  goodness-of-fit  statistic.  If  the  value  of  Xmin  exceeds 
a  critical  value,  it  is  recommended  that  one  rejects  the  model.  At  that  point,  the 
standard  option  is  to  use  an  alternative  model,  and  repeat  the  testing  procedure. 

There  are  cases  when  it  is  reasonable  to  try  a  bit  harder  and  investigate  further 
whether  the  model  and  the  dataset  may  still  be  compatible,  despite  the  poor 
goodness  of  fit.  The  general  situation  when  additional  effort  is  warranted  is  in  the 
case  of  a  model  that  generally  follows  the  data  without  severe  outliers,  yet  the  best- 
fit  statistic  (such  as  indicates  that  the  model  is  not  acceptable.  An  example  of 
this  situation  is  that  of  Fig.  10.1:  the  best-fit  linear  model  follows  the  distribution 
of  the  data  without  systematic  deviations,  yet  its  high  value  of  y}m[n  =  60.5  for  23 
degrees  of  freedom  cannot  be  formally  accepted  at  any  level  of  confidence. 

In  this  chapter  we  describe  two  types  of  analysis  that  can  be  performed  when 
the  fit  of  a  dataset  to  a  model  is  poor.  The  first  method  assumes  that  the  model 
itself  has  a  degree  of  uncertainty  that  results  in  an  intrinsic  scatter  above  and 
beyond  the  variance  of  the  data  (Sect.  11.2).  The  second  investigates  whether 
there  are  additional  sources  of  error  in  the  data  that  may  not  have  been  properly 
accounted  (Sect.  11.3).  The  two  methods  are  conceptually  different  but  result  in 
similar  modifications  to  the  analysis. 
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11.2  Intrinsic  Scatter  and  Debiased  Variance 

When  fitting  a  dataset  to  a  model  we  assume  that  the  data  are  drawn  from  a  parent 
model  that  is  described  by  a  number  of  parameters.  As  such,  we  surmise  that  there 
are  exact  model  parameters  that  describe  the  parent  distribution  of  the  data,  although 
we  don’t  know  their  precise  values.  We  use  the  data  to  estimate  them,  typically 
through  a  maximum  likelihood  method  that  consists  of  finding  model  parameters 
that  maximize  the  likelihood  of  the  data  being  drawn  from  that  model  (Chap.  8).  For 
Gaussian  data,  the  maximum  likelihood  method  consists  of  finding  the  minimum  of 
the  x2  statistic. 

A  possible  reason  for  a  poor  value  of  the  minimum  y2  statistic  is  that  the  model 
itself,  although  generally  accurate,  may  have  an  intrinsic  scatter  or  variance  that 
needs  to  be  accounted  in  the  determination  of  the  fit  statistic.  In  other  words,  the 
parent  model  may  not  be  exact  but  it  may  feature  an  inherent  degree  of  variability. 
The  goal  of  this  section  is  to  provide  a  method  to  describe  and  measure  such  scatter. 


11.2.1  Direct  Calculation  of  the  Intrinsic  Scatter 

Each  measurement  in  a  dataset  can  be  described  as  the  sum  of  two  variables, 

yi  =  r)i-\-€i,  (11.1) 

where  ly  represents  the  parent  value  from  which  the  measurement  yi  is  drawn  and 
6/  is  the  variable  representing  the  measurement  error.  Usually,  we  assume  that  r]i  — 
y(xi )  is  a  fixed  number,  estimated  by  the  least-squares  (or  other)  method.  Since  6; 
is  a  variable  of  zero  mean,  and  its  variance  is  simply  the  measurement  variance  cr2, 
(11.1)  implies  that  the  variance  of  the  measurement  yi  is  just  of. 

The  model  iji  may,  however,  be  considered  a  variable  with  non-zero  variance. 
This  is  to  describe  the  fact  that  the  model  is  not  known  exactly,  but  has  an  intrinsic 
degree  of  variability  measured  by  its  variance  ofnt  =  Var(r]i).  For  simplicity,  we 
assume  that  this  model  variance  is  constant  for  all  points  along  the  model.  Under  the 
assumption  that  the  measurement  error  and  the  model  are  independent,  variances  of 
the  variables  on  the  right-hand  side  of  (11.1)  add  and  this  yields  to 

ofnt  =  Var(yi)  -  erf .  (11.2) 

The  equation  means  that  the  intrinsic  variance  is  obtained  as  the  difference  of  the 
data  variance  minus  the  variance  due  to  measurement  errors.  In  keeping  up  with  the 
definitions  of  (1 1 . 1),  Var(yi )  refers  to  the  total  variance  of  the  i- th  variable  at  location 
Xi.  It  is  meaningful  to  calculate  the  average  variance  for  all  the  yfs  assuming  that 
each  measurement  is  drawn  from  a  parent  mean  of  jy,  the  best-fit  value  of  the  model 
y(xj).  In  so  doing,  we  make  use  of  the  fact  that  the  model  is  not  constant  but  it  varies 
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at  different  positions.  As  a  result,  (1 1.2)  can  be  used  to  calculate  the  intrinsic  scatter 
or  variance  of  the  model  a?  as 


1  N 

-  W!  i  J 


N  —  m 


i=  1 


(11.3) 


where  m  is  the  number  of  model  parameters.  The  intrinsic  variance  can  also  be 
referred  to  as  the  debiased  variance ,  because  of  the  subtraction  of  the  expected 
scatter  (due  to  measurement  errors)  from  the  total  sample  variance.  Equation  (11.3) 
can  be  considered  a  generalization  of  (2. 1 1)  in  two  ways.  First,  the  presence  of  errors 
in  the  measurements  of  yt  leads  to  the  addition  of  the  last  term  on  the  right-hand  side. 
Second,  the  total  variance  of  the  data  are  calculated  not  relative  to  the  data  mean  y 
but  to  the  parent  mean  of  each  measurement.  It  is  possible  that  the  second  term  in 
the  right-hand  side  of  (11.3)  is  larger  than  the  first  term,  leading  to  a  negative  value 
for  the  intrinsic  variance.  This  is  an  indication  that,  within  the  statistical  errors  a/, 
there  is  no  evidence  for  an  intrinsic  scatter  of  the  model.  This  method  to  estimate 
the  intrinsic  scatter  is  derived  from  [2]  and  [24] . 

It  is  important  to  remember  that  in  calculating  the  intrinsic  scatter  we  have  made 
the  assumption  that  the  model  is  an  accurate  representation  of  the  data.  This  means 
that  we  can  no  longer  test  for  the  null  hypothesis  that  the  model  represents  the  parent 
distribution — we  have  already  assumed  this  to  be  the  case. 

When  the  model  is  constant,  with  y*  =  y  being  the  sample  mean,  the  intrinsic 
scatter  is  calculated  as 


2  _ 
®int 


1  N  1  N 

i=  1  i=l 


(11.4) 


In  this  case,  (11.4)  is  an  unbiased  estimate  of  the  variance  of  Y. 


11.2.2  Alternative  Method  to  Estimate  the  Intrinsic  Scatter 

An  alternative  method  to  measure  the  amount  of  extra  variance  in  a  fit  makes  use  of 
the  fact  that,  for  a  Gaussian  dataset,  the  expected  value  of  the  reduced  /2 jn  is  one. 
A  large  value  of  the  minimum  /2  can  be  reduced  by  increasing  the  size  of  the  errors 

Until  Xjed  —  1  >  0r 
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(11.5) 


where  m  is  the  number  of  free  model  parameters,  and  Oint  is  the  intrinsic  scatter 
that  makes  the  reduced  /2  unity.  In  (11.5)  we  have  made  the  following  substitution 
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relative  to  the  standard  use  of  the  y}min  method: 

o?  ^  +  o?nt.  (11.6) 

This  method  is  only  approximate,  in  that  an  acceptable  model  need  not  yield  exactly 
a  value  of  x2red  ~  1*  This  method  to  estimate  the  intrinsic  scatter  is  nonetheless 
useful  as  an  estimate  of  the  level  of  scatter  present  in  the  data.  Like  in  the  earlier 
method,  the  analyst  is  making  the  assumption  that  the  model  fits  the  data  and  that 
the  extra  variance  is  attributed  to  an  intrinsic  variability  of  the  model  (ofnt). 

Example  11.1  The  example  shown  in  Fig.  10.1  illustrates  a  case  in  which  the  data 
do  not  show  systematic  deviations  from  a  best-fit  model,  and  yet  the  x2  test  would 
require  a  rejection  of  the  model.  The  quantities  Energy  1  (independent  variable)  and 
Energy  2  were  fit  to  a  linear  model,  the  best-fit  linear  model  yielded  a  fit  statistic  of 
xllin  =60.5  for  23  degrees  of  freedom  and  the  model  was  therefore  not  acceptable. 

Making  use  of  the  methods  developed  in  this  section,  we  can  estimate  the 
intrinsic  scatter  that  makes  the  model  consistent  with  the  data.  Using  (11.3),  the 
intrinsic  scatter  is  estimated  to  be  Oint  —  2.5.  This  means  that  the  model  has  a 
typical  uniform  variability  of  2.5  units  (the  units  are  those  of  the  y  axis,  in  this  case 
used  to  measure  energy).  Using  (11.5),  a  value  of  <jjnt  =  1.6  is  needed  to  obtain 
a  reduced  x2n  in  of  unity.  The  two  methods  were  not  expected  to  provide  the  same 
answer  since  they  are  based  on  different  assumptions.  <> 


11.3  Systematic  Errors 

The  errors  described  so  far  in  this  book  are  usually  referred  to  as  random  errors , 
since  they  describe  the  uncertainties  in  the  random  variables  of  interest.  There  are 
many  sources  of  random  error.  A  common  source  of  randome  error  is  the  Poisson  or 
counting  error  which  derives  from  measuring  N  counts  in  an  experiment  and  results 
in  an  error  of  */N.  Another  source  of  error  is  due  to  the  presence  of  a  background 
that  needs  to  be  subtracted  from  the  measured  signal.  In  general,  any  instrument  used 
to  record  data  will  have  sources  of  error  that  causes  the  measurements  to  fluctuate 
randomly  around  its  mean  value. 

One  of  the  main  tasks  of  a  data  analyst  is  to  find  all  the  important  sources  of  error 
that  contribute  to  the  variance  of  the  random  variable  of  interest.  A  typical  case  is 
the  measurement  of  a  total  signal  T  in  the  presence  of  a  background  B ,  where  the 
random  variable  of  interest  is  the  background- subtracted  signal  5, 

S  —  T  —  B.  (11.7) 

If  the  background  is  measured  independently  from  the  signal  T,  then  the  variance  of 
the  source  is 


(11.8) 
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The  lesson  to  learn  is  that  the  variance  of  the  random  variable  of  interest  S  increases 
when  the  background  is  subtracted.  If  one  assumes  that  there  is  no  background,  or 
that  the  background  is  constant  (crj  =  0),  the  random  error  associated  with  S  may 
be  erroneously  underestimated. 

The  term  statistical  error  is  often  used  as  a  synonym  of  random  error.  Sometimes, 
however,  it  is  used  to  designate  the  leading  source  of  random  error,  such  as  the 
Poisson  uncertainty  in  a  counting  experiment,  not  including  other  sources  of  random 
error  that  are  equally  statistical  or  random  in  nature.  Such  use  is  not  accurate,  but  the 
reader  should  be  aware  that  there  is  no  universally  accepted  meaning  for  the  term 
“statistical  error.” 

The  term  systematic  error  designates  sources  of  error  that  systematically  shift 
the  signal  of  interest  either  too  high  or  too  low.  Sources  of  systematic  errors  need  to 
be  identified  to  correct  the  erroneous  offset.  A  typical  example  is  an  instrument  that 
is  miscalibrated  and  systematically  reports  measurements  that  have  an  erroneous 
offset.  Even  after  the  correction  for  the  offset,  it  is  however  quite  likely  that  there  still 
remains  a  source  of  error,  for  example  associated  with  the  fact  that  such  correction 
may  not  be  uniform  for  all  datapoints.  If  the  systematic  error  is  additive  in  nature, 
i.e.,  it  shifts  the  random  variable  A  according  to  X'  —  X  ±  E,  then  the  variance  of 
the  data  is  to  be  modified  according  to 


'  2  _  2  ,  _2 
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The  term  crj  denotes  the  variance  of  the  systematic  error  E.  If  E  is  known  exactly, 
then  it  would  ideally  have  zero  variance.  But  in  all  practical  cases,  there  will  be  an 
additional  source  of  variance  from  the  correction  of  a  systematic  error  that  needs 
to  be  accounted.  The  modification  of  the  error  due  to  the  presence  of  a  source 
of  systematic  error  is  therefore  identical  in  form  to  the  presence  of  intrinsic  error 
[compare  (11.6)  and  (11.9)]. 

If  the  systematic  error  is  multiplicative  in  nature,  i.e.,  X'  —  E  •  X,  it  may  be 
convenient  to  use  the  logarithms,  log  A7  =  log  A  +  1  ogE  and  then  proceed  as  in  the 
case  of  a  linear  offset. 

Example  11.2  Continuing  with  the  example  shown  in  Fig.  10.1,  we  can  use  the 
results  provided  in  Example  11.1  to  say  that  an  additional  error  of  oE  —  1.6  would 
yield  a  fit  statistic  of  yynin  red  —  1 .  This  means  that  a  possible  interpretation  for 
the  large  value  of  is  that  we  had  neglected  an  additional  source  of  error  oE. 
This  additional  source  of  error  would  be  in  place  of  the  intrinsic  scatter,  since  either 
correction  to  the  calculation  of  y}min  is  sufficient  to  bring  the  data  in  agreement  with 
the  model. 

The  errors  of  the  data  in  Fig.  10.1  accounted  for  several  sources  of  random 
error,  including  Poisson  errors  in  the  counting  of  photons  from  these  sources,  the 
background  subtraction  and  for  errors  associated  with  the  model  used  to  describe 
the  distribution  of  energy.  The  additional  error  of  order  <je  —  1 .6  for  each  datapoint 
may  therefore  be  (a)  an  intrinsic  error  of  the  model  (as  described  in  Example  11.1), 
(b)  an  additional  error  from  the  correction  of  certain  systematic  errors  that  were 
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performed  in  the  process  of  the  analysis  or  (c)  an  additional  random  error  that  were 
not  already  included  in  the  original  error  budget.  The  magnitude  of  possible  errors 
in  cases  (b)  and  (c)  can  be  estimated  based  on  the  knowledge  of  the  collection  of 
the  data  and  its  analysis.  If  such  errors  cannot  be  as  large  as  required  to  obtain 
an  acceptable  fit,  the  only  remaining  option  is  to  attribute  this  error  to  an  intrinsic 
variance  of  the  model  or  to  conclude  that  the  model  is  not  an  accurate  description  of 
the  data.  O 


11.4  Estimate  of  Model  Parameters  with  Systematic  Errors 
or  Intrinsic  Scatter 


In  Sects.  11.2  and  11.3  we  have  assumed  that  intrinsic  scatter  or  additional  sources 
of  systematic  errors  could  be  estimated  using  the  best-fit  values  jy  obtained  from 
the  fit  without  these  errors.  Systematic  errors  or  intrinsic  scatter,  however,  do  have 
an  effect  on  the  estimate  of  model  parameters.  The  presence  of  systematic  errors  or 
intrinsic  scatter,  as  discussed  earlier  in  this  chapter,  is  accounted  with  the  addition 
of  another  source  of  variance  to  the  data  according  to 

cr-  2  =  a2  +  a2.  (11.10) 


The  quantity  a  is  either  the  systematic  error  ge  not  accounted  in  the  initial  estimate 
of  a/,  or  the  intrinsic  scatter  aint.  Both  cases  lead  to  the  same  effect  on  the  overall 
error  budget  and  the  /2  fit  statistic  to  minimize  becomes 


(yi  -  y(xi))2 

r~f  a?  +  a2 
1=  1  1 


(11.11) 


It  is  clear  that  repeating  the  fitting  procedure  with  the  larger  o\  errors  instead 
of  the  original  error  will  lead  to  new  best-fit  values  and  new  uncertainties  for  the 
model  parameters.  The  effect  of  the  larger  errors  is  to  de- weight  datapoints  that 
have  small  values  of  cr/  and  in  general  to  provide  larger  confidence  intervals  for  the 
model  parameters.  An  acceptable  procedure  to  obtain  truly  best- fit  values  of  model 
parameters  and  their  confidence  intervals  is  to  first  estimate  the  additional  source  of 
error  a  (either  an  intrinsic  scatter  or  additional  statistical  or  systematic  errors)  and 
then  repeat  the  fit. 

Example  11.3  The  linear  fit  to  the  data  of  Table  6.1  for  Energy  1  (independent 
variable)  and  Energy  2  resulted  in  a  /2  in  —  60.5  for  23  degrees  of  freedom.  The 
fit  was  not  acceptable  at  any  level  of  confidence.  In  Example  11.1  we  calculated 
that  an  additional  variance  of  a2  =  1.6  yields  a  /2/n  =  23.  We  fit  the  data  with 
the  addition  of  this  error  to  the  dependent  variable  and  find  the  best-fit  values  of 
a  =  -0.085  ±  0.48,  b  =  1.05  ±  0.05. 
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For  comparison,  the  fit  obtained  with  the  original  errors  returned  values  of  a  — 
—0.26  =b  0.088,  b  —  1.04  dz  0.27.  These  values  could  not  be  properly  called  “best- 
fit,”  since  the  fit  was  not  acceptable.  Yet,  comparison  between  these  values  and  those 
for  the  xjeci  =1-0  case  shows  that  best-fit  parameters  are  affected  by  the  additional 
source  of  error  and  that  the  confidence  intervals  become  larger  with  the  increased 
errors,  as  expected.  <> 


Summary  of  Key  Concepts  for  this  Chapter 


□  Intrinsic  scatter.  An  uncertainty  of  the  model  that  increases  the  measure¬ 
ment  error  according  to  yy  =  rjj  +  . 

□  Debias ed  variance :  A  correction  to  the  measured  variance  that  accounts 
for  the  presence  of  measurement  errors, 


- —  XXv' 
-  <  * 


N  —  m 


The  square  root  provides  a  measure  of  the  intrinsic  scatter. 

□  Systematic  error :  A  type  of  measurement  error  Oe  that  systematically  shifts 
the  measurements  (as  opposed  to  the  statistical  error  cr,).  The  two  errors 
typically  are  added  in  quadrature,  cr  2  =  a2  +  a2. 


Problems 

11.1  Fit  the  data  from  Table  6.1  for  the  radius  vs.  ratio  using  a  linear  model  and 
calculate  the  intrinsic  scatter  using  the  best-fit  linear  model. 

11.2  Using  the  same  data  as  in  Problem  11.1,  provide  an  additional  estimate  of  the 
intrinsic  scatter  using  the  /2^  ~  1  method. 

11.3  Justify  the  \/(N  —  m)  and  \/(N  —  1)  coefficients  in  (11.3)  and  (11.4). 

11.4  Using  the  data  for  the  Hubble  measurements  of  page  157,  assume  that  each 
measurement  of  log  v  has  an  uncertainty  of  a  =  0.01.  Estimate  the  intrinsic  scatter 
in  the  linear  regression  of  log  v  vs.  m. 

11.5  Using  the  data  of  Problem  8.2,  estimate  the  intrinsic  scatter  in  the  linear  fit  of 
the  X ,  Y  data. 


Chapter  12 

Fitting  Two- Variable  Datasets  with  Bivariate 
Errors 


Abstract  The  maximum  likelihood  method  for  the  fit  of  a  two-variable  dataset 
described  in  Chap.  8  assumes  that  one  of  the  variables  (the  independent  variable 
X )  has  negligible  errors.  There  are  many  applications  where  this  assumption  is 
not  applicable  and  uncertainties  in  both  variables  must  be  taken  into  account.  This 
chapter  expands  the  treatment  of  Chap.  8  to  the  fit  of  a  two- variable  dataset  with 
errors  in  both  variables. 


12.1  Two- Variable  Datasets  with  Bivariate  Errors 

Throughout  Chaps.  8  and  10  we  have  assumed  a  simple  error  model  where  the 
independent  variable  X  is  known  without  error,  and  all  sources  of  uncertainty  in 
the  fit  are  due  to  the  dependent  variable  Y.  The  two- variable  dataset  (X,  Y)  was 
effectively  treated  as  a  sequence  of  random  variables  of  values  yt  ±  07  at  a  fixed 
location  xt  with  a  parent  model  y(;q). 

There  are  many  applications,  however,  in  which  both  variables  have  comparable 
uncertainties  (ax  —  cry)  and  there  is  no  reason  to  treat  one  variable  as  independent. 
In  general,  a  two-variable  dataset  is  described  by  the  datapoints 

(Xi  zb  Gxi ,  yi  i  Gyi) 

and  the  covariance  cr2v/  between  the  two  measurements.  One  example  is  the  two 
measurements  of  energy  in  the  data  in  Table  6.1,  where  it  would  be  appropriate  to 
account  for  errors  in  both  measurements.  There  is  in  fact  no  particular  reason  why 
one  measurement  should  be  considered  as  the  independent  variable  and  the  other 
the  dependent  variable. 

There  are  several  methods  to  deal  with  two-variable  datasets  with  bivariate  error. 
Given  the  complexity  of  the  statistical  model,  there  is  not  a  uniquely  accepted 
solution  to  the  general  problem  of  fitting  data  with  bivariate  errors.  This  chapter 
presents  two  methods  for  the  linear  fit  to  data  with  two- variable  errors.  The  first 
method  (Sect.  12.2)  applies  to  a  linear  fit  and  it  is  an  extension  of  the  least-squares 
method  of  Sect.  8.3.  The  second  method  (Sect.  12.3)  is  based  on  an  alternative 
definition  of  and  it  applies  to  any  type  of  fit  function.  Although  this  method 
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does  not  have  an  analytic  solution,  it  can  be  easily  implemented  using  numerical 
methods  such  as  Monte  Carlo  Markov  chains  described  later  in  this  book. 


12.2  Generalized  Least-Squares  Linear  Fit  to  Bivariate  Data 


In  the  case  of  identical  measurement  errors  on  the  dependent  variable  Y  and  no 
error  on  the  independent  variable  X ,  the  least-squares  method  described  in  Sect.  8.3 
estimated  the  parameters  of  the  linear  model  as 


b  _  Cov^CY)  _  Eflifo  -x)(yj  -y) 
Var(X)  ~  £f=1(x;-x)2 

a  =  E(Y)  -  bE(X )  =  -J-Ev,-4E */• 

V  IV  i=l  IV  1=\ 


(12.1) 


A  generalization  of  this  least- squares  method  accounts  for  the  presence  of 
measurement  errors  in  the  estimate  of  the  variances  and  the  covariance  in  (12.1). 
The  methods  of  analysis  presented  in  this  section  were  developed  by  Akritas  and 
Bershady  [2]  and  others  [22,  24].  Those  references  can  be  used  as  source  of 
additional  information  on  these  methods  for  bivariate  data. 

Measurements  of  the  X  and  Y  variables  can  be  described  by 


%i  —  1 Ixi  T  ^xi 
Yi  ~  Vyi  T"  €yi , 


(12.2) 


each  the  sum  of  a  parent  quantity  and  a  measurement  error,  as  in  (11.1).  Accord¬ 
ingly,  the  variances  of  the  parent  variables  are  given  by 


Var(rjxi)  =  Var(xi )  -  o;t 
Var(rjyi )  =  Var(yi )  - 


(12.3) 


This  means  that  in  (12.1)  one  must  replace  the  sample  covariance  and  variance  by  a 
debiased  or  intrinsic  covariance  and  variance,  i.e.,  quantities  that  take  into  account 
the  presence  of  measurement  errors. 

The  method  of  analysis  that  led  to  (12.1)  assumes  that  the  variable  Y  depends  on 
X.  In  other  words,  we  assumed  that  X  is  the  independent  variable.  In  this  case,  we 
talk  of  a  fit  of  T-given-X,  or  Y/X,  and  we  write  the  linear  model  as 


y  —  aY/x  +  bY/xX •  (12.4) 

Modification  of  (12.1)  with  (12.3)  (and  an  equivalent  formula  for  the  covariance) 
leads  to  the  following  estimator  for  the  slope  and  intercept  of  the  linear  Y/X  model: 
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xy 


Cov(X,  V)  -  a?, 

by/x  =  - == 

Var(X )  -  0-2 

O-YjX  —y~  by/xX. 


E/L  i  (*«•  -  *)  to  -y)~  £f=  i  ^ 


IV 


(12.5) 


In  this  equation  the  sample  variance  and  covariance  of  (12.1)  were  replaced  with 
the  corresponding  intrinsic  quantities,  and  the  subscript  Y/X  indicates  that  X  was 
considered  as  the  independent  variable. 

A  different  result  is  obtained  if  Y  is  considered  as  the  independent  variable.  In 
that  case,  the  X-given-F  (or  X/Y)  model  is  described  as 


x  —  a  -\-  b'y. 


(12.6) 


The  same  equations  above  apply  by  exchanging  the  two  variables  X  and  Y : 


Eili  (*«  -  *)C V-  -  y)  -  £f=  i  °2xyi 


^ N 


V  = 


Ef=it> ’i-y)2-rU°2 

a!  —x  —  b'y. 


' N 


It  is  convenient  to  compare  the  results  of  the  Y/X  and  X/Y  fits  by  rewriting  the  latter 
in  the  usual  form  with  x  as  the  independent  variable: 


u  x 

y  =  ax/y  +  bx/Yx  =  +  - 


for  which  we  find  that  the  slope  and  intercept  are  given  by 


YH=M-y)2 -YH=i°yi 


T!!=\ (Xi  -  x)(yi  -y)-  £f=  1  o2x lyi 


bx/Y  - 
ax/Y  =  y-  bx/YX 


yi 


(12.7) 


In  general  the  two  estimators  Y/X  and  X/Y  will  give  different  results  for  the 
best-fit  line.  This  difference  highlights  the  importance  of  interpreting  the  data  to 
determine  which  variable  should  be  considered  the  independent  quantity. 

Uncertainties  in  the  parameters  a  and  b  and  the  covariance  between  them  have 
been  calculated  by  Akritas  and  Bershady  [2].  For  the  Y/X  estimator  they  can  be 
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obtained  via  the  following  variables: 


(*i  —  x)(yi  —  by/xXi  —  ay/x)  +  by/X^i  ~  Gxyi 

(12.8) 

£;  =yi  -  by/xXi  -  x&. 

With  these,  the  variances  of  a  and  b  and  the  covariance  is  given  by 

tfr/x  =  ^  E&  -  ?)2 

<^/x  =  ^E((W)2  (12-9) 

A*  =  ^E(f«-i)(£i-h 

For  the  X/Y  estimator  there  are  equivalent  formulas  for  the  £  and  £  variables  that 
need  to  be  used  in  place  of  (12.8): 

h  _(yt—  y)(yi  -  bx/yXi  -  ax/y)  +  bx/y^yi  ~  ayi 

*  ~  i  7  7  i  " 

-  J2(xi  -  x)(yi  -  y)  -  -  E  °%i  (12.10) 

=yt  ~  bx/yXi  -  x&. 


These  values  can  then  be  used  to  calculate  variances  and  the  covariance  of  the 
parameters  as  in  the  Y/X  fit. 

Example  12.1  In  Fig.  12. 1  we  illustrate  the  difference  in  the  best-fit  models  when  X 
is  the  independent  variable  (12.5)  or  Y  is  the  independent  variable  (12.7),  using  the 
data  of  Table  6.1.  The  Y/X  parameters  are  aY/x  —  —0.367  and  by/x  —  1.118  and 
the  X/Y  parameters  are  ax/y  —  —0.521  and  bx/y  —  1.132.  Unfortunately  there  is  no 
definitive  prescription  to  decide  which  variable  should  be  regarded  as  independent. 
In  this  example  each  variable  could  be  equally  treated  as  the  independent  variable 
and  the  difference  between  the  two  best-fit  models  is  relatively  small.  The  difference 
between  the  two  models  for  a  value  of  the  x  axis  of  1  is  approximately  20  %.  Note 
that  the  linear  model  and  the  data  were  plotted  in  a  logarithmic  scale  to  provide  a 
more  compact  figure. 

Also,  the  data  of  Table  6.1  do  not  report  any  covariance  measurement  and 
therefore  the  best-fit  lines  were  calculated  assuming  independence  between  all 
measurements  (crL  =  0).  <> 

xyi 

The  example  based  on  the  data  of  Table  6.1  show  that  there  is  not  just  a  single 
slope  for  the  best-fit  linear  model,  but  that  the  results  depend  on  which  variable 
is  assumed  to  be  independent,  as  in  the  case  of  no  measurement  errors  available 
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Energy  1 

Fig.  12.1  Linear  model  fits  to  the  data  of  Table  6.1  using  the  debiased  variance  method.  The  solid 
line  is  the  model  that  uses  Energy  1  as  the  independent  variable  X  (12.4),  the  dashed  line  is  the 
model  that  uses  Energy  2  as  the  independent  variable  Y  (12.6).  Note  the  logarithmic  scale  for  both 
axes 


(Sect.  8.5).  In  certain  cases  it  may  be  appropriate  to  use  a  model  that  is  intermediate 
between  the  two  Y/X  and  X/Y  results.  This  is  called  the  bisector  model,  which 
consists  of  the  linear  model  that  bisects  the  two  lines  obtained  from  the  Y/X  and 
X/Y  fits  described  above.  This  method  is  also  described  by  Akritas  and  Bershady 
[2]  and  Isobe  and  Feigelson  [22]  and  the  best-fit  bisector  line  can  be  obtained  from 
the  following  formulae: 


< 


V 


bbis  — 


by/xbx/Y  —  1  +  yj  0  +  £y/x)(  1  +  ^X/y) 


by  IX  +  bx/Y 


d  bis  —  y  bbisX. 


(12.11) 
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The  uncertainties  in  the  slope  and  intercept  parameters  can  also  be  obtained  using 
this  definition  for  the  £  and  £  variables: 


(1  +  b\/Y)bbis 

/  gY/x~r 

(Py/x  +  bx/y)  y  (1  +  b\jx)(\  +  b\/Y) 

(1  +b2Y/x)bhis  (12.12) 

- /  %X/Y 

(bY/x  +  bx/y)  y/(l  +  by/x)(  1  +  b\jY) 

£z  — V;  bjyisXi  v£/, 

where  £y/x  is  the  £  variable  defined  in  (12.8)  for  the  Y/X  fit  and  £x/f  is  the  £  variable 
defined  in  (12.10)  for  the  X/Y  fit. 

Example  12.2  Figure  12.2  shows  the  fit  to  the  variables  Radius  ( X  variable)  and 
Ratio  of  thermal  energies  ( Y  variable)  from  Table  6.1.  The  solid  line  is  the  Y/X 
best-fit  line  with  parameters  a  —  1.1253  and  b  —  —0.0005,  the  dashed  line  is 
the  X/Y  best- fit  line  with  parameters  a  —  1.4260  and  b  —  —0.0018  and  the  dot- 
dash  line  is  the  bisector  line  with  parameters  a  —  1.2778  and  b  —  —0.0011. 
Notice  how  the  Y/X  and  X/Y  regressions  give  significantly  different  results.  This 
is  in  part  due  to  the  presence  of  substantial  scatter  in  the  data,  which  results  in 
several  datapoints  significantly  distant  from  the  best- fit  regression  lines.  In  the  other 


Radius 

Fig.  12.2  Fit  to  the  data  of  Table  6.1  using  errors  in  both  variables  (see  Example  12.2) 
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example  of  regression  with  errors  in  both  variables  (Fig.  12.1)  the  Y/X  and  X/Y 
best-fit  lines  were  in  better  agreement.  <> 


12.3  Linear  Fit  Using  Bivariate  Errors  in  the  x2  Statistic 

An  alternative  method  to  fit  a  dataset  with  errors  in  both  variables  is  to  re-define  the 
X2  statistic  to  account  for  the  presence  of  errors  in  the  X  variable.  In  the  case  of  a 
linear  fit,  the  square  of  the  deviation  of  each  datapoint  yt  from  the  model  is  given  by 

( yi-a-bxj )2.  (12.13) 

When  there  is  no  error  in  the  X  variable,  the  variance  of  the  variable  in  (12.13)  is 
simply  the  variance  of  Y,  a2.  In  the  presence  of  a  variance  a2  for  X,  the  variance  of 
the  linear  combination  yt  —  a  —  bxi  is  given  by 

Var(yi  -a-  bx,)  =  a2  +  b2a2, 


where  a  and  b  are  the  parameters  of  the  linear  model  and  the  variables  X  and  Y  are 
assumed  to  be  independent.  This  suggests  a  new  definition  of  the  j2  function  for 
this  dataset  [35,  40],  namely 


X 


2 


X 


Oj  —  a-  bxt )2 


(12.14) 


Since  each  term  at  the  denominator  is  the  variance  of  the  term  at  the  numerator, 
the  new  / 2  variable  defined  in  (12.14)  is  /2 -distributed  with /  =  N  —  2  degrees  of 
freedom. 

The  complication  with  the  minimization  of  this  function  is  that  the  unknown 
parameter  b  appears  both  at  the  numerator  and  the  denominator  of  the  function  that 
needs  to  be  minimized.  As  a  result,  an  analytic  solution  to  the  maximum  likelihood 
method  cannot  be  given  in  general.  Fortunately,  the  problem  of  finding  the  values  of 
a  and  b  that  minimize  (12.14)  can  be  solved  numerically.  This  method  for  the  linear 
fit  of  two-variable  data  with  errors  in  both  coordinates  is  therefore  of  common  use, 
and  it  is  further  described  in  [35]. 
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Summary  of  Key  Concepts  for  this  Chapter 

□  Data  with  bivariate  errors :  A  two- variable  dataset  that  has  errors  in  both 
variables.  For  these  data  there  is  no  commonly  accepted  fit  method. 

□  Generalized  least-squares  fit  to  bivariate  data :  An  extension  of  the 
traditional  ML  fit  to  two-variable  data.  When  v  is  the  independent  variable 
the  best-fit  parameters  of  the  linear  model  are 


(,  Cov(X,  Y)-a% 

by/x  —  - = — 

<  Var(X )  -  a-2 

^Y/X  —y~  by/xX- 

□  Bisector  model :  A  best-fit  model  for  bivariate  data  that  bisects  the  Y/X  and 
X/Y  models,  intended  to  provide  and  intermediate  model. 

□  Use  of  bivariate  errors  in  x2 :  The  /2  statistic  can  also  be  redefined  to 
accommodate  bivariate  errors  according  to 


2 


E 


(ji- a  -  bxt )2 


Problems 

12.1  Use  the  bivariate  error  data  of  Energy  1  and  Energy  2  from  Table  6.1.  Calculate 
the  best-fit  parameters  and  errors  of  the  linear  model  Y/X ,  where  X  is  Energy  1  and 

Y  is  Energy  2. 

12.2  Use  the  bivariate  error  data  of  Energy  1  and  Energy  2  from  Table  6.1.  Calculate 
the  best-fit  parameters  and  errors  of  the  linear  model  X/Y,  where  X  is  Energy  1  and 

Y  is  Energy  2. 

12.3  For  the  Energy  1  and  Energy  2  data  of  Table  6.1,  use  the  results  of 
Problems  12.1  and  12.2  to  calculate  the  bisector  model  to  the  Energy  1  vs.  Energy 
2  data. 

12.4  Repeat  Problem  12.1  for  the  Ratio  vs.  Radius  data  of  Table  6.1. 

12.5  Repeat  Problem  12.2  for  the  Ratio  vs.  Radius  data  of  Table  6.1. 

12.6  Repeat  Problem  12.3  for  the  Ratio  vs.  Radius  data  of  Table  6.1. 


Chapter  13 

Model  Comparison 


Abstract  The  availability  of  alternative  models  to  fit  a  dataset  requires  a  quanti¬ 
tative  method  for  comparing  the  goodness  of  fit  to  different  models.  For  Gaussian 
data,  a  lower  reduced  of  one  model  with  respect  to  another  is  already  indicative  of 
a  better  fit,  but  the  outstanding  question  is  whether  the  value  is  significantly  lower, 
or  whether  a  lower  value  can  be  just  the  result  of  statistical  fluctuations.  For  this 
purpose  we  develop  the  distribution  function  of  the  F  statistic,  useful  to  compare 
the  goodness  of  fit  between  two  models  and  the  need  for  an  additional  “nested” 
model  component,  and  the  Kolmogorov-Smirnov  statistics,  useful  in  providing  a 
quantitative  measure  of  the  goodness  of  fit,  and  in  comparing  two  datasets  regardless 
of  their  fit  to  a  specific  model. 


13.1  The  F  Test 

For  Gaussian  data,  the  /2  statistic  is  used  for  determining  if  the  fit  to  a  given  parent 
function  y  (v)  is  acceptable.  It  is  possible  that  several  different  parent  functions  yield 
a  goodness  of  fit  that  is  acceptable.  This  may  be  the  case  when  there  are  alternative 
models  to  explain  the  experimental  data,  and  the  data  analyst  is  faced  with  the 
decision  to  determine  what  model  best  fits  the  experimental  data.  In  this  situation, 
the  procedure  to  follow  is  to  decide  first  a  confidence  level  that  is  considered 
acceptable,  say  90  or  99  %,  and  discard  all  models  that  do  not  satisfy  this  criterion. 
The  remaining  models  are  all  acceptable,  although  a  lower  /2  in  certainly  indicates 
a  better  fit. 

The  first  version  of  the  F  test  applies  to  independent  measurements  of  the  yfi 
fit  statistic,  and  its  application  is  therefore  limited  to  cases  that  compare  different 
datasets.  A  more  common  application  of  the  F  test  is  to  compare  the  fit  of  a  given 
dataset  between  two  models  that  have  a  nested  component,  i.e.,  one  model  is  a 
simplified  version  of  the  other.  For  nested  model  components  one  can  determine 
whether  the  additional  component  is  really  needed  to  fit  the  data. 
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13.1.1  F-Testfor  Two  Independent  /2  Measurements 


Consider  the  case  of  two  /mm  values  obtained  by  fitting  data  from  a  given 
experiment  to  two  different  functions,  yi(v)  and  yii*)-  If  both  models  equally  well 
approximate  the  parent  model,  then  we  would  expect  that  the  two  values  of  /2 
would  be  similar,  after  taking  into  consideration  that  they  may  have  a  different 
number  of  degrees  of  freedom.  But  if  one  is  a  better  approximation  to  the  parent 
model,  then  the  value  of  /2  for  such  model  would  be  significantly  lower  than  for 
the  other.  We  therefore  want  to  proceed  to  determine  whether  both  /2  in  statistics 
are  consistent  with  the  null  hypothesis  that  the  data  are  drawn  from  the  respective 
model.  The  statistic  to  use  to  compare  the  two  values  of  / 2  must  certainly  also  take 
into  account  the  numbers  of  degrees  of  freedom,  which  is  related  to  the  number 
of  model  parameters  used  in  each  determination  of  x2 .  In  fact,  a  larger  number  of 
model  parameters  may  result  in  fact  result  in  a  lower  value  of  /2  in,  simply  because 
of  the  larger  flexibility  that  the  model  has  in  following  the  data.  For  example,  a 
dataset  of  N  points  will  always  be  fitted  perfectly  by  a  polynomial  having  N  terms, 
but  this  does  not  mean  that  a  simpler  model  may  not  be  just  as  good  a  model  for  the 
data,  and  the  underlying  experiment. 

Following  the  theory  described  in  Sect.  7.4,  we  define  the  F  statistic  as 


F  = 


XX  .min  If I  %  1  .min. red 

Ti  .minify  X  2,  min,  red 


(13.1) 


where  /j  and  /2  are  the  degrees  of  freedom  of  x\  min  and  x\  min  •  Assuming  that  the 
two  x2  statistics  are  independent ,  then  F  will  be  distributed  like  the  F  statistic  with 
/i  degrees  of  freedom,  having  a  mean  of  approximately  1  [see  (7.22)  and  (7.24)]. 

There  is  an  ambiguity  in  the  definition  of  which  of  the  two  models  is  labeled  as  1 
and  which  as  2,  since  two  numbers  can  be  constructed  that  are  the  reciprocal  of  each 
other,  F\2  =  I/F21.  The  usual  form  of  the  F-test  is  that  in  which  the  value  of  the 
statistic  is  F  >  1,  and  therefore  we  choose  the  largest  of  Fn  and  F21  to  implement 
a  one-tailed  test  of  the  null  hypothesis  with  significance  p , 


1  -p 


poo 

F  crit 


fF(f,x)dx  =  P(F  >  Fcrit) 


(13.2) 


Critical  values  Fcrit  are  reported  in  Tables  A. 8,  A. 9,  A. 10,  A. 11,  A. 12,  A. 13,  A. 14, 
and  A.  15  for  various  confidence  levels  p. 

The  null  hypothesis  is  that  the  two  values  of  /2  in  are  distributed  following 
a  x2  distributions;  this,  in  turn,  means  that  the  respective  fitting  functions  used 
to  determine  each  x2nin  are  both  good  approximations  of  the  parent  distribution. 
Therefore  the  test  based  on  this  distribution  can  reject  the  hypothesis  that  both  fitting 
functions  are  the  parent  distribution.  If  the  test  rejects  the  hypothesis  at  the  desired 
confidence  level,  then  only  one  of  the  models  will  still  stand  after  the  test — the  one 
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at  the  denominator  with  the  lowest  reduced  y2 — even  if  the  value  of  x2min  alone  was 
not  able  to  discriminate  between  the  two  models. 

Example  13.1  Consider  the  radius  vs.  ratio  data  of  Table  6.1  (see  also  Prob¬ 
lem  11.1).  The  linear  fit  to  the  entire  dataset  is  not  acceptable,  and  therefore  a  linear 
model  for  all  measurements  must  be  discarded.  If  we  consider  measurements  1 
through  5,  6  through  10,  and  1 1  through  15,  a  linear  fit  to  these  two  subsets  results  in 
the  values  of  best-fit  parameters  and  y2  shown  in  the  table,  along  with  the  probability 
to  exceed  the  value  of  the  fit  statistic. 


Measurements 

a 

b 

Y2  • 

A  min 

Probability 

1-5 

0.97  ±  0.09 

-0.0002  ±  0.0002 

5.05 

0.17 

6-10 

1.27  ±0.22 

-0.0007  ±0.0011 

6.19 

0.10 

10-15 

0.75  ±  0.09 

-0.0002  ±  0.0003 

18.59 

0.0 

The  third  sample  provides  an  unacceptable  fit  to  the  linear  model,  and  therefore 
this  subset  cannot  be  further  considered.  For  the  first  two  samples,  the  fits  are 
acceptable  at  the  90  %  confidence  level,  and  we  can  construct  the  F  statistic  as 


F=  *L(6~ 1Q) 

XlbQ  -5) 


1.23. 


Both  y2  have  the  same  number  of  degrees  of  freedom  (3),  and  Table  A.  13  shows 
that  the  value  of  1.23  is  certainly  well  within  the  90  %  confidence  limit  for  the  F 
statistics  {Fcrit  ~  5.4).  This  test  shows  that  both  subsets  are  equally  well  described 
by  a  linear  fit,  and  therefore  the  F-test  cannot  discriminate  between  them. 

To  illustrate  the  power  of  the  F-test,  assume  that  there  is  another  set  of  five 
measurements  that  yield  a  y2nin  =  1.0  when  fit  to  a  linear  model.  This  fit  is  clearly 
acceptable  in  terms  of  its  y2  probability.  Constructing  an  F  statistic  between  this 
new  set  and  set  6-10,  we  would  obtain 


xL(6-io ) 


6.19. 


In  this  case,  the  value  of  F  is  not  consistent  at  the  90  %  level  with  the  F  distribution 
with/i  —  f 2  —  3  degrees  of  freedom  (the  measured  value  exceeds  the  critical  value). 
The  F-test  therefore  results  in  the  conclusion  that,  at  the  90  %  confidence  level,  the 
two  sets  are  not  equally  likely  to  be  drawn  from  a  linear  model,  with  the  new  set 
providing  a  better  match.  <> 

It  is  important  to  note  that  the  hypothesis  of  independence  of  the  two  y2  is  not 
justified  if  the  same  data  are  used  for  both  statistics.  In  practice,  this  means  that  the  F 
statistic  cannot  be  used  to  compare  the  fit  of  a  given  dataset  to  two  different  models. 
The  test  can  still  be  used  to  test  whether  two  different  datasets,  derived  from  the 
same  experiment  but  with  independent  measurements,  are  equally  well  described 
by  the  same  parametric  model,  as  shown  in  the  example  above.  In  this  case,  the 
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null  hypothesis  is  that  both  datasets  are  drawn  from  the  same  parent  model,  and  a 
rejection  of  the  hypothesis  means  that  both  datasets  cannot  derive  from  the  same 
distribution. 


13.1.2  F -Test  for  an  Additional  Model  Component 

Consider  a  model  y(v)  with  m  adjustable  parameters,  and  another  model  y(v) 
obtained  by  fixing  p  of  the  m  parameters  to  a  reference  (fixed)  value.  In  this  case, 
the  y(v)  model  is  said  to  be  nested  into  the  more  general  model,  and  the  task  is  to 
determine  whether  the  additional  p  parameters  of  the  general  model  are  required  to 
fit  the  data. 

Example  13.2  An  example  of  nested  models  are  polynomial  models.  The  general 
model  can  be  taken  as  a  polynomial  of  second  order, 

r\ 

y(v)  =  a  +  bx  +  cx 
and  the  nested  model  as  a  linear  model, 

y(v)  =  a  +  bx. 

The  nested  model  is  obtained  from  the  general  model  with  c  —  0  and  has  one  fewer 
degree  of  freedom  than  the  general  model.  <> 

Following  the  same  discussion  as  in  Chap.  10,  we  can  say  that 

(  xl,i„  ~  X2(N  -  m)  (full  model) 

<  (13.3) 

(  Xmin  ~  X2(N  -  m  +  p)  (“nested”  model). 

Clearly  xhin  <  jfL  because  of  the  additional  free  parameters  used  in  the 
determination  of  Xmin-  A  lower  value  of  /2 does  not  necessarily  mean  that  the 
additional  parameters  of  the  general  model  are  required.  The  nested  model  can  in 
fact  achieve  an  equal  or  even  better  fit  relative  to  the  parent  distribution  of  the  fit 
statistic,  i.e.,  a  lower  x%^  because  of  the  larger  number  of  degrees  of  freedom.  In 
general,  a  model  with  fewer  parameters  is  to  be  preferred  to  a  model  with  larger 
number  of  parameters  because  of  its  more  economical  description  of  the  data, 
provided  that  it  gives  an  acceptable  fit. 

In  Sect.  10.3  we  discussed  that,  when  comparing  the  true  value  of  the  fit  statistic 
X2true  for  the  parent  model  to  the  minimum  x~min  obtained  by  minimizing  a  set  of  p 
free  parameters,  Ax2  —  Xtrue  ~  Xmin  and  JfL  are  independent  of  one  another,  and 
that  A x2  is  distributed  like  with  p  degrees  of  freedom.  There  are  situations  in 
which  the  same  properties  apply  to  the  two  statistics  described  in  (13.3),  such 
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that  the  statistic  A/2  is  distributed  like 


Af  =  Xmin  ~  Xlin  ~  *2(p)>  (13-4) 

and  it  is  independent  of  /2;-n .  One  such  case  of  practical  importance  is  precisely  the 
one  under  consideration,  i.e.,  when  there  is  a  nested  model  component  described  by 
parameters  that  are  independent  of  the  other  model  parameters.  A  typical  example 
is  an  additional  polynomial  term  in  the  fit  function,  as  illustrated  in  the  example 
above. 

In  this  case,  the  null  hypothesis  we  test  is  that  y  (x)  and  y(x)  are  equivalent  models, 
i.e.,  adding  the p  parameters  does  not  constitute  a  significant  change  or  improvement 
to  the  model.  Under  this  hypothesis  we  can  use  the  two  independent  statistics  A/2 
and  x2nin  ’  and  construct  a  bona  fide  F  statistic  as 


AX2/P 


(13.5) 


This  statistic  tests  the  null  hypothesis  using  an  F  distribution  with/i  =  p,fi  —  N—m 
degrees  of  freedom.  A  rejection  of  the  hypothesis  indicates  that  the  two  models 
y(x)  and  y(x)  are  not  equivalent.  In  practice,  a  rejection  constitutes  a  positive  result, 
indicating  that  the  additional  model  parameters  in  the  nested  component  are  actually 
needed  to  fit  the  data.  A  common  situation  is  when  there  is  a  single  additional 
model  parameter,  p  —  1 ,  and  the  corresponding  critical  values  of  F  are  reported 
in  Table  A. 8.  A  discussion  of  certain  practical  cases  in  which  additional  model 
components  may  obey  (13.4)  is  provided  in  a  research  article  by  Protassov  [36]. 

Example  13.3  The  data  of  Table  10.1  and  Fig.  10.2  are  well  fit  by  a  linear  model, 
while  a  constant  model  appears  not  to  be  a  good  fit  to  all  measurements.  Using  only 
the  middle  three  measurements,  we  want  to  compare  the  goodness  of  fit  to  a  linear 
model,  and  that  to  a  constant  model,  and  determine  whether  the  addition  of  the  b 
parameter  provides  a  significant  improvement  to  the  fit. 

The  best-fit  linear  model  has  a  /2//?  =0.13  which,  for/2  =  N  —  m  =  1  degree  of 
freedom,  with  a  probability  to  exceed  this  value  of  72  %,  i.e.,  it  is  an  excellent  fit.  A 
constant  model  has  a  xmin  —  7.9,  which,  for  2  degrees  of  freedom,  has  a  probability 
to  exceed  this  value  of  >0.01,  i.e.,  it  is  acceptable  at  the  99%  confidence  level, 
but  not  at  the  90  %  level.  If  the  analyst  requires  a  level  of  confidence  <90  %,  then 
the  constant  model  should  be  discarded,  and  no  further  analysis  of  the  experiment  is 
needed.  If  the  analyst  can  accept  a  99  %  confidence  level,  we  can  determine  whether 
the  improvement  in  between  the  constant  and  the  linear  model  is  significant.  We 
construct  the  statistic 


Xmin  ~  X2 


min 


1 


y2 .  1 

A  min 


59.4 
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which,  according  to  Table  A. 8  for/i  =  1  and/2  =  1,  is  significant  at  the  99% 
(and  therefore  95  %)  confidence  level,  but  not  at  90  %  or  lower.  In  fact,  the  critical 
value  of  the  F  distribution  with  f\  —  1 ,  f2  —  1  at  the  99  %  confidence  level  is 
F crit  —  4, 052.  Therefore  a  data  analyst  willing  to  accept  a  99  %  confidence  level 
should  conclude  that  the  additional  model  component  b  is  not  required,  since  there 
is  >1  %  (actually,  >5  %)  probability  that  such  an  improvement  in  the  /2  statistic  is 
due  by  chance,  and  not  by  the  fact  that  the  general  model  is  truly  a  more  accurate 
description  of  the  data.  <> 

The  example  above  illustrates  the  principle  of  simplicity  or  parsimony  in  the 
analysis  of  data.  When  choosing  between  two  models,  both  with  an  acceptable  fit 
statistic  at  the  same  confidence  level  (in  the  previous  example  at  the  99  %  level), 
one  should  prefer  the  model  with  fewer  parameters,  even  if  its  fit  statistic  (e.g., 
the  reduced  /2in)  is  inferior  to  that  of  the  more  complex  model.  This  general 
guiding  principle  is  sometimes  referred  to  as  Occam  \s  razor ,  after  the  Middle  Ages 
philosopher  and  Franciscan  friar  William  of  Occam. 


13.2  Kolmogorov-Smirnov  Tests 

Kolmogorov-Smirnov  tests  are  a  different  method  for  the  comparison  of  a  one¬ 
dimensional  dataset  to  a  model,  or  for  the  comparison  of  two  datasets  to  one  another. 
The  tests  make  use  of  the  cumulative  distribution  function,  and  are  applicable  to 
measurements  of  a  single  variable  X ,  for  example  to  determine  if  it  is  distributed 
like  a  Gaussian.  For  two- variable  dataset,  the  /2  and  F  tests  remain  the  most  viable 
option. 

The  greatest  advantage  the  Kolmogorov-Smirnov  test  is  that  it  does  not  require 
the  data  to  be  binned,  and,  for  the  case  of  the  comparison  between  two  dataset,  it 
does  not  require  any  parameterization  of  the  data.  These  advantages  come  at  the 
expense  of  a  more  complicated  mathematical  treatment  to  find  the  distribution  func¬ 
tion  of  the  test  statistic.  Fortunately,  numerical  tables  and  analytical  approximations 
make  these  tests  manageable. 


13.2.1  Comparison  of  Data  to  a  Model 

Consider  a  random  variable  X  with  cumulative  distribution  function  F(x).  The  data 
consist  of  N  measurements,  and  for  simplicity  we  assume  that  they  are  in  increasing 
order,  x\  <  x^  <  ...  <  x^.  This  condition  can  be  achieved  by  re-labelling  the 
measurements,  which  preserves  the  statistical  properties  of  the  data.  The  goal  is  to 
construct  a  statistic  that  describes  the  difference  between  the  sample  distribution  of 
the  data  and  a  specified  distribution,  to  test  whether  the  data  are  compatible  with 
this  distribution. 
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Start  with  the  sample  cumulative  distribution 

Fn  (pc)  =  —  [#  of  measurements  <  x] 


(13.6) 


By  definition,  0  <  FN(x)  <  1.  The  test  statistic  we  want  to  use  is  defined  as 


Dn  —  max  \Fn(x)  —  F(x) | , 

X 


(13.7) 


where  F(x)  is  the  parent  distribution,  and  the  maximum  value  of  the  difference 
between  the  parent  distribution  and  the  sample  distribution  is  calculated  for  all 
values  in  the  support  of  X. 

One  of  the  remarkable  properties  of  the  statistic  DN  is  that  it  has  the  same 
distribution  for  any  underlying  distribution  of  X ,  provided  X  is  a  continuous  variable. 
The  proof  that  DN  has  the  same  distribution  regardless  of  the  distribution  of  X 
illustrates  the  properties  of  the  cumulative  distribution  and  of  the  quantile  function 
presented  in  Sect.  4.8. 

Proof  We  assume  that  F(x)  is  continuous  and  strictly  increasing.  This  is 
certainly  the  case  for  a  Gaussian  distribution,  or  any  other  distribution  that 
does  not  have  intervals  where  the  distribution  functions  is  f(x)  =  0.  We  make 
the  change  of  variables  y  =  F(x ),  so  that  the  measurement  Xk  corresponds  to 
yk  =  F(xk).  This  change  of  variables  is  such  that 


Fn(x )  = 


(#  of  Xi  <  x) 


(#  of  yk  <  y ) 
N 


where  UN(y)  is  the  sample  cumulative  distribution  of  Y  and  0  <  y  <  1.  The 
cumulative  distribution  of  Y  is 


U(y)  =  P(Y  <y)=  P(X  <  x)  =  F(x)  =  y. 

The  fact  that  the  cumulative  distribution  is  U(y)  =  y  shows  that  Y  is  a  uniform 
distribution  between  0  and  1 .  As  a  result,  the  statistic  DN  is  equivalent  to 

Dn  =  max  |t/tf(y)  “  c/(y)l 

0<  v<  1 

where  Y  is  a  uniform  distribution.  Since  this  is  true  no  matter  the  original 
distribution  X,  DN  has  the  same  distribution  for  any  X.  Note  that  this  derivation 
relies  on  the  continuity  of  X ,  and  this  assumption  must  be  verified  to  apply  the 
resulting  Kolmogorov-Smirnov  test.  □ 

The  distribution  function  of  the  statistic  DN  was  determined  by  Kol¬ 
mogorov  in  1933  [25],  and  it  is  not  easy  to  evaluate  analytically.  In  the  limit 
of  large  A,  the  cumulative  distribution  of  DN  is  given  by 

oo 

lim  P(DN  <z/Vn)=  T  (-l)re-2r2zl  =  0(z). 

N->oo  L ' 

r=—oo 


(13.8) 
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Table  13.1  Critical  points  of 
the  Kolmogorov  distribution 
Dn  for  large  values  of  N 


Confidence  level 


p 

's/~NDn 

0.50 

0.828 

0.60 

0.895 

0.70 

0.973 

0.80 

1.073 

0.90 

1.224 

0.95 

1.358 

0.99 

1.628 

The  function  @(z)  can  also  be  used  to  approximate  the  probability  distribution 
of  Dn  for  small  values  of  N,  using 

P(Dn  <z/(Vn  +  0.12  + 0.1  l/VV))  ~  <P(z).  (13.9) 

A  useful  numerical  approximation  for  P(DN  <  z)  is  also  provided  in  [30]. 

The  probability  distribution  of  DN  can  be  used  to  test  whether  a  sample 
distribution  is  consistent  with  a  model  distribution.  Critical  values  of  the  DN 
distribution  with  probability  p , 


P(DN  <  Tcrit)  =  p  (13.10) 

are  shown  in  Table  13.1  in  the  limit  of  large  N.  For  small  A,  critical  values  of  the 
Dn  statistic  are  provided  in  Table  A. 25.  If  the  measured  value  for  DN  is  greater  than 
the  critical  value,  then  the  null  hypothesis  must  be  rejected,  and  the  data  are  not 
consistent  with  the  model.  The  test  allows  no  free  parameters,  i.e.,  the  distribution 
that  represents  the  null  hypothesis  must  be  fully  specified. 

Example  13.4  Consider  the  data  from  Thomson’s  experiment  to  measure  the  ratio 
m/e  of  an  electron  (page  23).  We  can  use  the  Dn  statistic  to  test  whether  either  of 
the  two  measurement  of  the  variable  m/e  is  consistent  with  a  given  hypothesis.  It  is 
necessary  to  realize  that  the  Kolmogorov-Smirnov  test  applies  to  a  fully  specified 
hypothesis  Ho,  i.e.,  the  parent  distribution  F(x)  cannot  have  free  parameter  that  are 
to  be  determined  by  a  fit  to  the  data.  We  use  a  fiducial  hypothesis  that  the  ratio 
is  described  by  a  Gaussian  distribution  of  p  —  5.7  (the  true  value  in  units  of 
107  g  Coulomb-1,  though  the  units  are  unnecessary  for  this  test),  and  a  variance 
of  a2  =  1.  Both  measurements  are  inconsistent  with  this  model,  as  can  be  seen 
from  Fig.  13.1.  See  Problem  13.1  for  a  quantitative  analysis  of  the  results.  <> 
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Fig.  13.1  Kolmogorov-Smirnov  test  applied  to  the  measurements  of  the  ratio  m/ e  from  Thom¬ 
son’s  experiments  described  on  page  23.  The  black  line  corresponds  to  the  measurements  for 
Tube  1,  and  the  red  line  to  those  of  Tube  2  (measurements  have  been  multiplied  by  107).  The 
dot-dashed  line  is  the  cumulative  distribution  of  a  Gaussian  with  /i  =  5.7  (the  correct  value)  and 
a  fiducial  variance  of  a2  =  1 


13.2.2  Two-Sample  Kolmogorov-Smirnov  Test 


A  similar  statistic  can  be  defined  to  compare  two  datasets: 


Dnm  =  max  I  Fm{x)  -  Gw(x)| 

X 


(13.11) 


where  FM(x)  is  the  sample  cumulative  distribution  of  a  set  of  M  observations,  and 
Gn(x )  that  of  another  independent  set  of  N  observations;  in  this  case,  there  is  no 
parent  model  used  in  the  testing.  The  statistic  DNM  measures  the  maximum  deviation 
between  the  two  cumulative  distributions,  and  by  nature  it  is  a  discrete  distribution. 
In  this  case,  we  can  show  that  the  distribution  of  the  statistic  is  the  same  as  in  (13.9), 
provided  that  the  change 


MN 

N  ->  - 

M  +  N 

is  made.  This  number  can  be  considered  as  the  effective  number  of  datapoints 
of  the  two  distributions.  For  the  two- sample  Kolmogorov-Smirnov  DNM  test  we 
can  therefore  use  the  same  table  as  in  the  Kolmogorov-Smirnov  one- sample  test, 
provided  N  is  substituted  with  MN / (M  +  N )  and  that  N  and  M  are  both  large. 
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As  N  and  M  become  large,  the  statistic  approaches  the  following  distribution: 


lim  P 

A/'vM— >oo 


Dnm  <  z/ 


MN 
M  +  N 


=  <p(z) 


(13.12) 


Proof  We  have  already  shown  that  for  a  sample  distribution  with  M  points, 

Fm(x)  -  F(x)  =  UM(y )  -  U(y), 

where  U  is  a  uniform  distribution  in  (0,1).  Since 

Fm(x)  ~  Gn(x)  =  Fm(x)  —  F  —  (Gn(x)  —  G ), 

where  F  —  G  is  the  parent  distribution,  it  follows  that  FM(x)  —  Gn(x )  =  Un  — 
Vn,  where  Um  and  Vn  are  the  sample  distribution  of  two  uniform  variables. 
Therefore  the  statistic 


Dnm  =  max  | Fm(x)  -  G^(v)| 


is  independent  of  the  parent  distribution,  same  as  for  the  statistic  DN. 

Next  we  show  how  the  factor  yj  1  /N  +  1  /M  originates.  It  is  clear  that  the 
expectation  of  FM(x)  —  Gn(x )  is  zero,  at  least  in  the  limit  of  large  N  and  M; 
the  second  moment  can  be  calculated  as 


E[(Fm(x)  -  Gn(x))2]  =  E[(Fm(x)  -  F{x))2} 
+E[(Gn(x)  -  G(.v))2]  +  2 E[(Fm(x)  -  F(x))(Ga,(x)  -  G(x))] 

=  E[(Fm(x)  -  F{x))2}  +  E[(Gn(x)  -  G(x))2] 


In  fact,  since  FM(x)  —  F(x)  is  independent  of  Gjy(x)  —  G(x),  their  covariance 
is  zero.  Each  of  the  two  remaining  terms  can  be  evaluated  using  the  following 
calculation: 


F  [(Fm(x)  -  F(x))2]  =  E 


—  ({#  of  Xi  s  <  x}  —  MF(x)y 


M 2 


E  [({#  of  s  <  x}  —  E[{#  of  Xi’s  <  x}])2] 


For  a  fixed  value  of  x,  the  variable  {#  of  x/’s  <  x}  is  a  binomial  distribution 
in  which  “success”  is  represented  by  one  measurement  being  <  x,  and  the 
probability  of  success  is  p  =  F(x).  The  expectation  in  the  equation  above  is 
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therefore  equivalent  to  the  variance  of  a  binomial  distribution  with  M  tries,  for 
which  o2  =  Mp(\  —  p),  leading  to 


E  [(Fm(x)  -  Fix))1}  =  1  F(x)(l  -  F(x)). 


It  follows  that 

E[(Fm(x)  -  Gn(x ))2]  =  +  i  J  F(x)(  1  -  Fix)) 

A  simple  way  to  make  the  mean  square  of  Fm(x)—Gn(x)  independent  of  N  and 
M  is  to  divide  it  by  a/TJm  +  TJn.  This  requirement  is  therefore  a  necessary 
condition  for  the  variable  y/NM/  ( N  +  M)Dnm  to  be  independent  of  N  and  M. 

Finally,  we  show  that  yj NM /  (N  +  M)Dnm  is  distributed  in  the  same  way 
as  VNDN,  at  least  in  the  asymptotic  limit  of  large  N  and  M.  Using  the  results 
from  the  DN  distribution  derived  in  the  previous  section,  we  start  with 


max 


MN 
M  +  N 


(Fm(x)  ~  Gn(x)) 


max 

0<  v<  1 


MN 


M  +  N 


(UM  ~  VN )) 


The  variable  can  be  rewritten  as 

J  MN  ( UM  -U  +  (V-  VNj)  =  J  N  (Vm(Um  -  Uj) 
yM+N  yM+N 

+  J——(VN(Vn  -  V)). 

yM  +  N 

Using  the  central  limit  theorem,  it  can  be  shown  that  the  two  variables  a  = 
Vm(Um  —  U)  and  /3  =  Vn(V^  —  V )  have  the  same  distribution,  which  tends 
to  a  Gaussian  in  the  limit  of  large  M.  We  then  write 


/  MN  ,  I  N  /  M  ^ 

\  - (Fm(x)  —  Gn(x))  —  \  j - a  +  \\ - fi 

and  use  the  property  that,  for  two  independent  and  identically  distributed 
Gaussian  variables  a  and  /3  the  variable  a  •  a  +  b  •  /3  is  distributed  like  a, 
provided  that  a2  +  b2  =  1.  We  therefore  conclude  that,  in  the  asymptotic 
limit, 


Dnm  —  max 


MN 
M  +  N 


(Fm(x)  —  Gn(x)) 


~  max 


W(vw  -  v ) 


□ 
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Example  13.5  We  can  use  the  two-sample  Kolmogorov-Smirnov  statistic  to  com¬ 
pare  the  data  from  Tube  #1  and  Tube  #2  of  Thomson’s  experiment  to  measure  the 
ratio  m/e  of  an  electron  (page  23).  The  result,  shown  in  Fig.  13.1,  indicates  that  the 
two  measurements  are  not  in  agreement  with  one  another.  See  Problem  13.2  for  a 
quantitative  analysis  of  this  test.  <> 


Summary  of  Key  Concepts  for  this  Chapter 

□  F  Test :  A  test  to  compare  two  independent  /2  measurements, 

F  Xl,red/ X.2  ,red’ 


□  F  Test  for  additional  component :  The  significance  of  an  additional  model 
component  with  p  parameters  can  be  tested  using 


X2mJ(N  -  m) 


when  the  additional  component  is  nested  within  the  general  model. 

□  Kolmogorov-Smirnov  test :  A  non-parametric  test  to  compare  a  one- 
variable  dataset  to  a  model  or  two  datasets  with  one  another. 


Problems 

13.1  Using  the  data  from  Thomson’s  experiment  at  page  23,  determine  the  values 
of  the  Kolmogorov-Smirnov  statistic  D #  for  the  measurement  of  Tube  #1  and  Tube 
#2,  when  compared  with  a  Gaussian  model  for  the  measurement  with  p  —  5.7  and 
a2  =1.  Determine  at  what  confidence  level  you  can  reject  the  hypothesis  that  the 
two  measurements  are  consistent  with  the  model. 

13.2  Using  the  data  from  Thomson’s  experiment  at  page  23,  determine  the  values 
of  the  two- sample  Kolmogorov-Smirnov  statistic  DNM  for  comparison  between  the 
two  measurements.  Determine  at  what  confidence  level  you  can  reject  the  hypothesis 
that  the  two  measurements  are  consistent  with  one  another. 

13.3  Using  the  data  of  Table  10.1,  determine  whether  the  hypothesis  that  the  last 
three  measurements  are  described  by  a  simple  constant  model  can  be  rejected  at  the 
99  %  confidence  level. 

13.4  A  given  dataset  with  N  =  5  points  is  fit  to  a  linear  model,  for  a  fit  statistic  of 
Xmin .  When  adding  an  additional  nested  parameter  to  the  fit,  p  —  1 ,  determine  by 
how  much  should  the  /2  in  be  reduced  for  the  additional  parameter  to  be  significant 
at  the  90  %  confidence  level. 


13.2  Kolmogorov-Smirnov  Tests 
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13.5  A  dataset  is  fit  to  model  1,  with  minimum  /2  fit  statistic  of  x\  —  10  for  5 
degrees  of  freedom;  the  same  dataset  is  also  fit  to  another  model,  with  x\  —  5  for  4 
degrees  of  freedom.  Determine  which  model  is  acceptable  at  the  90  %  confidence, 
and  whether  the  F  test  can  be  used  to  choose  one  of  the  two  models. 

13.6  A  dataset  of  size  N  is  successfully  fit  with  a  model,  to  give  a  fit  statistic  xtnin  •  A 
model  with  a  nested  component  with  1  additional  independent  parameter  for  a  total 
of  m  parameters  is  then  fit  to  /2 in,  providing  a  reduction  in  the  fit  statistic  of  A/2. 
Determine  what  is  the  minimum  A/2  that,  in  the  limit  of  a  large  number  of  degrees 
of  freedom,  provides  90  %  confidence  that  the  additional  parameter  is  significant. 


Chapter  14 

Monte  Carlo  Methods 


Abstract  The  term  Monte  Carlo  refers  to  the  use  of  random  variables  to  evaluate 
quantities  such  as  integrals  or  parameters  of  fit  functions  that  are  typically  too 
complex  to  evaluate  via  other  analytic  methods.  This  chapter  presents  elementary 
Monte  Carlo  methods  that  are  of  common  use  in  data  analysis  and  statistics, 
in  particular  the  bootstrap  and  jackknife  methods  to  estimate  parameters  of  fit 
functions. 


14.1  What  is  a  Monte  Carlo  Analysis? 

The  term  Monte  Carlo  derives  from  the  name  of  a  locality  in  the  Principality  of 
Monaco  known  for  its  resorts  and  casinos.  In  statistics  and  data  analysis  Monte 
Carlo  is  an  umbrella  word  that  means  the  use  of  computer-aided  numerical  methods 
to  solve  a  specific  problem,  typically  with  the  aid  of  random  numbers. 

Traditional  Monte  Carlo  methods  include  numerical  integration  of  functions  that 
can  be  graphed  but  that  don’t  have  a  simple  analytic  solution  and  simulation  of  ran¬ 
dom  variables  using  random  samples  from  a  uniform  distribution.  Another  problem 
that  benefits  by  the  use  of  random  numbers  is  the  estimation  of  uncertainties  in  the 
best-fit  parameters  of  analytical  models  used  to  fit  data.  There  are  cases  when  an 
analytical  solution  for  the  error  in  the  parameters  is  not  available.  In  many  of  those 
cases,  the  bootstrap  or  the  jackknife  methods  can  be  used  to  obtain  reliable  estimates 
for  those  uncertainties. 

Among  many  other  applications,  Monte  Carlo  Markov  chains  stand  out  as  a  class 
of  Monte  Carlo  methods  that  is  now  commonplace  across  many  fields  of  research. 
The  theory  of  Markov  chains  (Chap.  15)  dates  to  the  early  twentieth  century,  yet 
only  over  the  past  20  years  or  so  it  has  found  widespread  use  as  Monte  Carlo  Markov 
chains  (Chap.  16)  because  of  the  computational  power  necessary  to  implement  the 
method. 
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14.2  Traditional  Monte  Carlo  Integration 


A  common  numerical  task  is  the  evaluation  of  the  integral  of  a  function /(v)  for 
which  analytic  solution  is  either  unavailable  or  too  complicated  to  calculate  exactly, 


(14.1) 


We  want  to  derive  a  method  to  approximate  this  integral  by  randomly  drawing 
N  samples  from  the  support  A.  For  simplicity,  we  assume  that  the  domain  of  the 
variable  f(x)  is  a  subset  of  real  numbers  between  a  and  b.  We  start  by  drawing 
samples  from  a  uniform  distribution  between  these  two  values, 

„  x  i -  if  a  <  x  <  b 

g(x)=lb-a  ~  ~  (14.2) 

0  otherwise. 


Recall  that  for  a  random  variable  X  with  continuous  distribution/^),  the  expecta¬ 
tion  (or  mean  value)  is  defined  as 


xg(x)dx 


(2.6);  we  have  also  shown  that  the  mean  can  be  approximated  as 


(14.3) 


E[X]  ~ 


1 

N 


N 


X; 


i=  1 


where  v/  are  independent  measurements  of  that  variable.  The  expectation  of  the 
function  f(x)  of  a  random  variable  is 

/oo 

f(x)g{x)dx, 

-OO 

and  it  can  be  estimated  using  the  Law  of  Large  Numbers  (Sect.  4.5): 


£[f(x)]  ~  2  y^/(x;). 

i=  1 


(14.4) 


These  equations  can  be  used  to  approximate  the  integral  in  (14.1)  as  a  simple  sum: 


rb  l 

/  =  (b-a)  /  f(x)g(x)dx  =  (b  -  a)E[f(x)]  ~  (b  -  a)—  ^/te).  (14.5) 

N  i=  i 

Equation  (14.5)  can  be  implemented  by  drawing  N  random  uniform  samples  Xi  from 
the  support,  then  calculating/^/),  and  evaluating  the  sum.  This  is  the  basic  Monte 
Carlo  integration  method,  and  it  can  be  easily  implemented  by  using  a  random 
number  generator  available  in  most  programming  languages. 


14.3  Dart  Monte  Carlo  Integration  and  Function  Evaluation 
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The  method  can  be  generalized  to  more  than  one  dimension;  if  the  support  A  C 
M'7  has  volume  V ,  then  the  integration  of  an  ^-dimensional  function  f(x)  is  given  by 
the  following  sum: 


I 


(14.6) 


It  is  clear  that  the  precision  in  the  evaluation  of  the  integral  depends  on  the 
number  of  samples  drawn.  The  error  made  by  this  method  of  integration  can  be 
estimated  using  the  following  interpretation  of  (14.6):  the  quantity  Vf(x)  is  the 
random  variable  of  interest,  and  I  is  the  expected  value.  Therefore,  the  variance 
of  the  random  variable  is  given  by  the  usual  expression, 

V2  N 

of  =  —  -f)2.  (14.7) 

1=1 

This  means  that  the  relative  error  in  the  calculation  of  the  integral  is 


a,  1  JtLi  (/(*/)  -/)2  l 

—  = - - -  (X  - ’ 

1  VN  Ef=  ,/(*,■)  Vn’ 


(14.8) 


as  expected,  the  relative  error  decreases  like  the  square  root  of  N ,  same  as  for  a 
Poisson  variable.  Equation  (14.8)  can  be  used  to  determine  how  many  samples  are 
needed  to  estimate  an  integral  with  a  given  precision. 


14.3  Dart  Monte  Carlo  Integration  and  Function  Evaluation 

Another  method  to  integrate  a  function,  or  to  perform  related  mathematical  opera¬ 
tions,  can  be  shown  by  way  of  an  example.  Assume  that  we  want  to  measure  the  area 
of  a  circle  of  radius  R.  One  can  draw  a  random  sample  of  N  values  in  the  (x,  y )  plane, 
as  shown  in  Fig.  14.1,  and  count  all  the  points  that  fall  within  the  circle,  N(R).  The 
area  of  the  circle,  or  any  other  figure  with  known  analytic  function,  is  accordingly 
estimated  as 


x  v 


(14.9) 


in  which  V  is  the  volume  sampled  by  the  two  random  variables.  In  the  case  of  a 
circle  of  radius  R  =  1  we  have  V  =  4,  and  since  the  known  area  is  A  —  txR 2,  this 
method  provides  an  approximation  to  the  number  n. 
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Fig.  14.1  Monte  Carlo  method  to  perform  a  calculation  of  the  area  of  a  circle  (also  a  simulation 
of  the  number  re),  with  N  =  1000  iterations 


Notice  that  (14.9)  is  equivalent  to  (14.6),  in  which  the  sum  J2f(xd  becomes 
N(R),  where /(v/)  =  1  indicates  that  a  given  random  data  point  xt  falls  within  the 
boundaries  of  the  figure  of  interest. 

Example  14.1  (Simulation  of  the  Number  n)  Figure  14.1  shows  a  Monte  Carlo 
simulation  of  the  number  tt,  using  1000  random  numbers  drawn  in  a  box  of  linear 
size  2,  encompassing  a  circle  of  radius  R  —  1 .  The  simulation  has  a  number  N(R)  = 
112  of  points  within  the  unit  circle,  resulting  in  an  estimate  of  the  area  of  the  circle 
as  TtR2  =  0.772  x  4  =  3.088.  Compared  with  the  exact  result  of  tv  =  3.14159,  the 
simulation  has  an  error  of  1 .7  %.  According  to  (14.8),  a  1000  iteration  simulation  has 
an  expected  relative  error  of  order  3.1  %,  therefore  the  specific  simulation  reported 
in  Fig.  14.1  is  consisted  with  the  expected  error,  and  more  numbers  must  be  drawn 
to  improve  the  precision.  O 


14.4  Simulation  of  Random  Variables 

A  method  for  the  simulation  of  a  random  variable  was  discussed  in  Sect.  4.8.  Since 
the  generation  of  random  samples  from  a  uniform  random  variable  was  involved, 
this  method  also  falls  under  the  category  of  Monte  Carlo  simulations. 


14.4  Simulation  of  Random  Variables 
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The  method  is  based  on  (4.42): 


X  =  F~\U), 

in  which  F~l  represents  the  inverse  of  the  cumulative  distribution  of  the  target 
variable  X,  and  U  represents  a  uniform  random  variable  between  0  and  1.  In 
Sect.  4.8  we  provided  the  examples  on  how  to  use  (4.42)  to  simulate  an  exponential 
distribution,  which  has  a  simple  analytic  function  for  its  cumulative  distribution. 

The  Gaussian  distribution  is  perhaps  the  most  common  variable  in  many 
statistical  applications,  and  its  generation  cannot  be  accomplished  by  (4.42),  since 
the  cumulative  distribution  is  a  special  function  and  F(x )  does  not  have  a  close  form. 
A  method  to  overcome  this  limitation  was  discussed  in  Sect.  4.8.2,  and  it  consists  of 
using  two  uniform  random  variables  U  and  V  to  simulate  two  standard  Gaussians  X 
and  Y  of  zero  mean  and  unit  variance  via  (4.45), 

(x=  V — 2 ln(l  —  U)  ■  cos(2ttV) 

(  Y  =  y-21n(l  -  U)  •  sin(2 nV). 

A  Gaussian  X'  of  mean  /r  and  variance  a1  is  related  to  the  standard  Gaussian  X 
by  the  transformation 


and  therefore  it  can  be  simulated  via 

X'  =  G— 21n(l  -  U)  ■  cos(2ttV))  a  +  fi.  (14.11) 

Figure  14.2  shows  a  simulation  of  a  Gaussian  distribution  function  using  (14.1 1). 
Precision  can  be  improved  with  increasing  number  of  samples. 


Fig.  14.2  Monte  Carlo 
simulation  of  the  probability 
distribution  function  of  a 
Gaussian  of  /x  =  1  and 
a2  =  2  using  1000  samples 
according  to  (14. 1 1) 


-4  -2  0  2  4  6 


Random  Variable  X 
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14.5  Monte  Carlo  Estimates  of  Errors  for  Two-Variable 
Datasets 

The  two  methods  presented  in  this  section,  the  bootstrap  and  the  jackknife,  are 
among  the  most  common  techniques  to  estimate  best-fit  parameters  and  their  uncer¬ 
tainties  in  the  fit  to  two- variable  datasets.  We  have  seen  in  previous  chapters  that  the 
best- fit  parameters  and  their  uncertainties  can  be  estimated  analytically,  for  example, 
in  the  case  of  a  linear  regression  with  known  errors  in  the  dependent  variable.  In 
those  cases,  the  exact  analytical  solution  is  typically  the  most  straightforward  to 
implement.  When  the  analytic  solution  to  a  maximum  likelihood  fit  is  unavailable, 
then  x1  minimization  followed  by  the  xhm  +  A/2  criterion  can  also  be  used  to 
measure  best- fit  values  and  uncertainties  in  the  parameters.  Finally,  Markov  chain 
Monte  Carlo  methods  to  be  presented  in  Chap.  16  can  also  be  used  in  virtually  any 
case  for  which  the  likelihood  can  be  calculated. 

The  two  methods  presented  in  this  section  have  a  long  history  of  use  in  statistical 
data  analysis,  and  had  been  in  use  since  well  before  the  Markov  chain  Monte  Carlo 
methods  became  of  wide  use.  The  bootstrap  and  jackknife  methods  are  typically 
easier  to  implement  than  a  Monte  Carlo  Markov  chain.  In  particular,  the  bootstrap 
uses  a  large  number  of  repetitions  of  the  dataset,  and  therefore  is  computer  intensive; 
the  older  jackknife  method  instead  uses  just  a  small  number  of  additional  random 
datasets,  and  requires  less  computing  resources. 


14.5.1  The  Bootstrap  Method 

Consider  a  dataset  Z  composed  of  N  measurements  of  either  a  random  variable  or, 
more  generally,  a  pair  of  variables.  The  bootstrap  method  consists  of  generating 
as  large  a  number  of  random,  “synthetic”  datasets  based  on  the  original  set.  Each 
set  is  then  used  to  determine  the  distribution  of  the  random  variable  (e.g.,  for  the 
one-dimensional  case)  or  of  the  best- fit  parameters  for  the  y(v)  model  (for  the  two- 
dimensional  case).  The  method  has  the  following  steps: 

1 .  Draw  at  random  N  datapoints  from  the  original  set  Z,  with  replacement,  to  form 
a  synthetic  dataset  Zt.  The  new  dataset  has  therefore  the  same  dimension  as  the 
original  set,  but  a  few  of  the  original  points  may  be  repeated,  and  a  few  missing. 

2.  For  each  dataset  Z*,  calculate  the  parameter(s)  of  interest  at.  For  example,  the 
parameters  can  be  calculated  using  a  j2  minimization  technique. 

3.  Repeat  this  process  as  many  times  as  possible,  say  Nj,oot  times. 

4.  At  the  end  of  the  process,  the  parameters  an,  n  —  1, . . . ,  Nj,ooU  approximate  the 
posterior  distribution  of  the  parameter  of  interest.  These  values  can  therefore 
be  used  to  construct  the  sample  distribution  function  for  the  parameters,  and 
therefore  obtain  the  best- fit  value  and  confidence  intervals. 
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Notice  that  one  advantage  of  the  bootstrap  method  is  that  it  can  be  used  even  in  cases 
in  which  the  errors  on  the  datapoints  are  not  available,  which  is  a  very  common 
occurrence.  In  this  situation,  the  direct  maximum  likelihood  method  applied  to  the 
original  set  Z  alone  would  not  provide  uncertainties  in  the  best-fit  parameters,  as 
explained  in  Sect.  8.5.  Since  at  each  iteration  the  best- fit  parameters  alone  must 
be  evaluated,  a  dataset  without  errors  in  the  dependent  variable  can  still  be  fit  to 
find  the  best-fit  parameters,  and  the  bootstrap  method  will  provide  an  estimate  of 
the  uncertainties.  This  is  one  of  the  main  reasons  why  the  bootstrap  method  is  so 
common. 

Example  14.2  ( Bootstrap  Analysis  of  Hubble  ’s  Data )  We  perform  a  bootstrap  anal¬ 
ysis  on  the  data  from  Hubble’s  experiment  of  page  157.  The  dataset  Z  consists  of  the 
ten  measurements  of  the  magnitude  m  and  logarithm  of  the  velocity  log  v ,  as  shown 
in  Fig.  8.2.  We  generate  10,000  random  synthetic  datasets  of  ten  measurements  each, 
for  which  typically  a  few  of  the  original  datapoints  are  repeated.  Given  that  error 
bars  on  the  dependent  variable  log  v  were  not  given,  we  assume  that  the  uncertainties 
have  a  common  value  for  all  measurement  (and  therefore  the  value  of  the  error  is 
irrelevant  for  the  determination  of  the  best- fit  parameters).  For  each  dataset  Zz  we 
perform  a  linear  regression  to  obtain  the  best- fit  values  of  the  parameters  <2;  and  bj. 

The  sample  distributions  of  the  parameters  are  shown  in  Fig.  14.3;  from  them, 
we  can  take  the  median  of  the  distribution  as  the  “best-fit”  value  for  the  parameter, 
and  the  68%  confidence  interval  as  the  central  range  of  each  parameter  that 
contains  68  %  of  the  parameter  occurrences.  It  is  clear  that  both  distributions  are 
somewhat  asymmetric;  the  situation  does  not  improve  with  a  larger  number  of 
bootstrap  samples,  since  there  is  only  a  finite  number  of  synthetic  datasets  that 
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Fig.  14.3  Monte  Carlo  bootstrap  method  applied  to  the  data  from  Hubble’s  experiment.  {Left) 
Sample  distribution  of  parameter  a,  with  a  median  of  a  =  0.54  and  a  68  %  central  range  of  0.45- 
0.70.  (Right)  Distribution  of  b,  with  median  b  =  0.197  and  a  central  range  of  0.188-0.202.  The 
best-fit  values  of  the  original  dataset  Z  were  found  to  be  a  =  0.55  and  b  =  0.197  (see  page  159) 
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can  be  generated  at  random,  with  replacement,  from  the  original  dataset  (see 
Problem  14.1).  <> 

A  key  feature  of  the  bootstrap  method  is  that  it  is  an  unbiased  estimator  for  the 
model  parameters.  We  can  easily  prove  this  general  property  in  the  special  case  of  a 
one-dimensional  dataset,  with  the  goal  of  estimating  the  sample  mean  and  variance 
of  the  random  variable  X  from  N  independent  measurements.  It  is  clear  that  we 
would  normally  not  use  the  bootstrap  method  in  this  situation,  since  (2.8)  and  (5.4) 
provide  the  exact  solution  to  the  problem.  The  following  proof  is  used  to  show  that 
the  bootstrap  method  provides  unbiased  estimates  for  the  mean  and  variance  of  a 
random  variable. 

Proof  The  sample  average  calculated  for  a  given  bootstrap  dataset  Zt  is  given 
by 


xi  =  jjJ2xjnji  (14.12) 

j=  1 

where  nji  is  the  number  of  occurrence  of  datapoint  Xj  in  the  synthetic  set  Zt.  If 
riji  —  0  it  means  that  Xj  was  not  selected  for  the  set,  nji  =  1  it  means  that  there 
is  just  one  occurrence  of  Xj  (as  in  the  original  set),  and  so  on.  The  number 
<  N ,  and  it  is  a  random  variable  that  is  distributed  like  a  binomial  with 
p  —  1/N,  since  the  drawing  for  each  bootstrap  set  is  done  at  random,  and  with 
replacement.  Therefore,  we  find  that 


E[nji]  —  Np  —  1 

N  -  1 

Varfiij )  =  of  =  Np(\  -  p)  = 


(14.13) 


where  the  expectation  is  calculated  for  a  given  dataset  Z,  drawing  a  large 
number  of  bootstrap  sample  based  on  that  specific  set.  It  follows  that  xi  is 
an  unbiased  estimator  of  the  sample  mean, 


E[xi]  =  j-  ^2xjE[nji]  =  x.  (14.14) 

j=  1 

The  expectation  operator  used  in  the  equation  above  relates  to  the  way  in 
which  a  specific  synthetic  dataset  can  be  drawn,  i.e.,  indicates  an  “average” 
over  a  specific  dataset.  The  operation  of  expectation  should  also  be  repeated 
to  average  over  all  possible  datasets  Z  consisting  of  N  measurements  of  the 
random  variable  X,  and  that  operation  will  also  result  in  an  expectation  that  is 
equal  to  the  parent  mean  of  X , 


E[x]  =  /x. 


(14.15) 
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Although  we  used  the  same  symbol  for  the  expectation  of  (14. 14)  and  (14. 15), 
the  two  operations  are  therefore  different  in  nature. 

The  proof  that  the  variance  of  the  sample  mean  of  dataset  Z;  is  an  unbiased 
estimator  of  the  parent  variance  <j2/N  is  complicated  by  the  fact  that  the 
random  variables  riy  are  not  independent.  In  fact,  they  are  related  by 

N 

j2nij  =  N>  (l4-16) 

i=  1 

and  this  enforces  a  negative  correlation  between  the  variables  that  vanishes 
only  in  the  limit  of  very  large  N.  It  can  be  shown  that  the  covariance  of  the 
riif  s  (say,  the  covariance  between  riy  and  n were  i  ^  k,  and  i  labels  the 
dataset)  is  given  by 


1 

N' 


(14.17) 


The  proof  of  (14. 17)  is  left  as  an  exercise,  and  it  is  based  on  the  use  of  (14. 16), 
and  (4.3)  (see  Problem  14.2). 

The  variance  of  xt  can  be  calculated  using  (4.3),  since  jq-  is  a  linear 
combination  of  N  random  variables  riy\ 


Var(xi)  —  Var 


1 

N 2 


XjXkofk 


1 

V 


N 


N 


^EE  xxk 


7=1  k=j+ 1 


in  which  we  have  used  the  results  of  (14.13)  and  (14.17).  Next,  we  need  to 
calculate  the  expectation  of  this  variance,  in  the  sense  of  varying  the  dataset  Z 
itself: 


E[Var(xi)] 


N-  1 

N3 


2 

N3 


(14.18) 


The  last  sum  in  the  equation  above  is  over  all  pairs  (/,  k)\  the  factor  1/2  takes 
into  account  the  double-counting  of  terms  such  as  XjXk  and  XkXj,  and  the  sum 
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contains  a  total  of  N(N  —  1)  identical  terms.  Since  the  measurements  xu  Xj  are 
independent  and  identically  distributed,  E[xjXk]  —  E[xj]2,  it  follows  that 

E[Var(xt)]  =  (E[xt  ]  -  E[xt]  )  = 

where  a2  is  the  variance  of  the  random  variable  X ,  and  a2  =  a2  /N  the 
variance  of  the  sample  mean.  The  equation  states  that  E[Var(xi)]  =  E[s2], 
where  s2  is  the  sample  variance  of  X.  We  showed  in  Sect.  5.1.2  that  the  sample 
variance  is  an  unbiased  estimator  of  the  variance  of  the  mean,  provided  it  is 
multiplied  by  the  known  factor  N/(N  —  1).  In  practice,  when  calculating  the 
variance  from  the  N  bootstrap  samples,  we  should  use  the  factor  l  /  (N  —  1) 
instead  of  1/N,  as  is  normally  done  according  to  (5.6).  □ 


14.5.2  The  Jackknife  Method 

The  jackknife  method  is  an  older  Monte  Carlo  method  that  makes  use  of  just  N 
resampled  datasets  to  estimate  best- fit  parameters  and  their  uncertainties.  As  in  the 
bootstrap  method,  we  consider  a  dataset  Z  of  N  independent  measurements  either 
of  a  random  variable  X  or  of  a  pair  of  random  variables.  The  method  consists  of  the 
following  steps: 

1 .  Generate  a  resampled  dataset  Zj  by  deleting  the  jth  element  from  the  dataset.  This 
resampled  dataset  has  therefore  dimension  N  —  1 . 

2.  Each  dataset  Zj  is  used  to  estimate  the  parameters  of  interest.  For  example,  apply 
the  linear  regression  method  to  dataset  Zj  and  find  the  best-fit  values  of  the  linear 
model,  cij  and  bj. 

3.  The  parameters  of  interest  are  also  calculated  from  the  full-dimensional  dataset 
Z,  as  one  normally  would.  The  best-fit  parameters  are  called  a . 

4.  For  each  dataset  Zj ,  define  the  pseudo-values  a*  as 

a]  —  Na  —  (N  —  l)aj  (14.19) 

5.  The  jackknife  estimate  of  each  parameter  of  interest  and  its  uncertainty  are  given 
by  the  following  equations: 


1  n 

a*  —  —  Y]  a* 

1  N 

— ; - 7  E  —  a*)2. 

N(N-l)jti  J 


°a2*  = 


(14.20) 


To  prove  that  (14.20)  provide  an  accurate  estimate  for  the  parameters  and  their 
errors,  we  apply  them  to  the  simple  case  of  the  estimate  of  the  mean  from  a  sample  of 
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N  measurements.  In  this  case  we  want  to  show  that  the  expectation  of  the  jackknife 
estimate  of  the  mean  a *  is  equal  to  the  parent  mean  /x,  and  that  the  expectation  of 
its  variance  o2+  is  equal  to  o2  /N. 

Proof  For  a  sample  of  N  measurements  of  a  random  variable  x,  the  sample 
mean  and  its  variance  are  given  by 


V 


v  = 

5^  _ 

N  “ 


1  f 

Njk 1 
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N(N  —  1)  S 
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£(*>■ 


7X2 


-x) 


(14.21) 


The  proof  consists  of  showing  that  a*  =  x,  so  that  a*  is  the  sample  mean 
and  o2*  is  the  sample  variance.  The  result  follows  from: 


1 


N  —  1 


E 


Xi,a  = 


1 

N 


N 


Xi 


i=  1 


N-  1 


N-  1 


E  a'  -  xj ■ 


Notice  that  the  factor  of  l /(N  —  1)  was  used  in  the  calculation  of  the  sample 
variance,  according  to  (5.6).  □ 

Example  14.3  In  the  case  of  the  Hubble  experiment  of  page  157,  we  can  use  the 
jackknife  method  to  estimate  the  best- fit  parameters  of  the  fit  to  a  linear  model  of 
m  versus  lo gv.  According  to  (14.20),  we  find  that  a*  =  0.52,  o>  =  0.13,  and 
b*  —  0.199,  Ob*  —  0.008.  These  estimates  are  in  very  good  agreement  with  the 
results  of  the  bootstrap  method,  and  those  of  the  direct  fit  to  the  original  dataset  for 
which,  however,  we  could  not  provide  uncertainties  in  the  fit  parameters.  <> 


Summary  of  Key  Concepts  for  this  Chapter 

□  Monte  Carlo  method :  Any  numerical  method  that  makes  use  of  random 
variables  to  perform  calculations  that  are  too  complex  to  be  performed 
analytically,  such  as  Monte  Carlo  integration  and  “dart”  methods. 

□  Bootstrap  method :  A  common  method  to  estimate  model  parameters  that 
uses  a  large  number  of  synthetic  datasets  obtained  by  re-sampling  of  the 
original  data. 

□  Jackknife  method :  A  simple  method  to  estimate  model  parameters  that  uses 
just  N  re- sampled  datasets. 
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Problems 

14.1  Calculate  how  many  synthetic  bootstrap  datasets  can  be  generated  at  random 
from  a  dataset  Z  with  N  unique  datapoints.  Notice  that  the  order  in  which  the 
datapoints  appear  in  the  dataset  is  irrelevant. 

14.2  For  a  bootstrap  dataset  Zj  constructed  from  a  set  Z  of  N  independent 
measurements  of  a  variable  X,  show  that  the  covariance  between  the  number  of 
occurrence  rip  and  is  given  by  (14.17), 


14.3  Perform  a  numerical  simulation  of  the  number  tt,  and  determine  how  many 
samples  are  sufficient  to  achieve  a  precision  of  0.1  %.  The  first  six  significant  digits 
of  the  number  are  n  —  3.14159. 

14.4  Perform  a  bootstrap  simulation  on  the  Hubble  data  presented  in  Fig.  14.3,  and 
find  the  68  %  central  confidence  ranges  on  the  parameters  a  and  b. 

14.5  Using  the  data  of  Problem  8.2,  run  a  bootstrap  simulation  with  N  —  1000 
iterations  for  the  fit  to  a  linear  model.  After  completion  of  the  simulation,  plot  the 
sample  probability  distribution  function  of  the  parameters  a  and  b ,  and  find  the 
median  and  68  %  confidence  intervals  on  the  fit  parameters.  Describe  the  possible 
reason  why  the  distribution  of  the  fit  parameters  are  not  symmetric. 

14.6  Use  the  data  of  Problem  8.2,  but  assuming  that  the  errors  in  the  dependent 
variable  y  are  unknown.  Run  a  bootstrap  simulation  with  N  —  1000  iterations,  and 
determine  the  median  and  68  %  confidence  intervals  on  the  parameters  a  and  b  to 
the  fit  to  a  linear  model. 

14.7  Using  the  data  of  Problem  8.2,  assuming  that  the  errors  in  the  dependent 
variable  y  are  unknown,  estimate  the  values  of  a  and  b  to  the  fit  to  a  linear  model 
using  a  jackknife  method. 

14.8  Given  two  uniform  random  variables  U\  and  U2  between  —  R  and  +R,  as 
often  available  in  common  programming  software,  provide  an  analytic  expression 
to  simulate  a  Gaussian  variable  of  mean  (i  and  variance  a2. 


Chapter  15 

Introduction  to  Markov  Chains 


Abstract  The  theory  of  Markov  chains  is  rooted  in  the  work  of  Russian  mathe¬ 
matician  Andrey  Markov  and  has  an  extensive  body  of  literature  to  establish  its 
mathematical  foundations.  The  availability  of  computing  resources  has  recently 
made  it  possible  to  use  Markov  chains  to  analyze  a  variety  of  scientific  data.  Monte 
Carlo  Markov  chains  are  now  one  of  the  most  popular  methods  of  data  analysis. 
This  chapter  presents  the  key  mathematical  properties  of  Markov  chains,  necessary 
to  understand  its  implementation  as  Monte  Carlo  Markov  chains. 


15.1  Stochastic  Processes  and  Markov  Chains 

This  section  presents  key  mathematical  properties  of  Markov  chains.  The  treatment 
is  somewhat  theoretical,  but  necessary  to  ensure  that  the  applications  we  make  to  the 
analysis  of  data  are  consistent  with  the  mathematics  of  Markov  chains,  which  can 
be  very  complex.  The  goal  is  therefore  that  of  defining  and  understanding  a  basic 
set  of  definitions  and  properties  necessary  to  use  Markov  chains  for  the  analysis  of 
data,  especially  via  the  Monte  Carlo  simulations. 

Markov  chains  are  a  specific  type  of  stochastic  processes,  or  sequence  of  random 
variables.  A  typical  example  of  Markov  chain  is  the  so-called  random  walk,  in  which 
at  each  time  step  a  person  randomly  takes  a  step  either  to  the  left,  or  to  the  right.  As 
time  progresses,  the  location  of  the  person  is  the  random  variable  of  interest,  and  the 
collection  of  such  random  variables  forms  a  Markov  chain.  The  ultimate  goal  of  a 
Markov  chain  is  to  determine  the  stationary  distribution  of  the  random  variable.  For 
the  random  walk,  where  we  are  interested  in  knowing  the  probability  that  at  a  given 
time  the  person  is  located  n  steps  to  the  right  or  to  the  left  of  the  starting  point. 

In  the  typical  case  of  interest  for  the  analysis  of  data,  a  dataset  Z  is  fit  to  a 
parametric  model.  The  goal  is  to  create  a  Markov  chain  for  each  parameter  of 
the  model,  in  such  a  way  that  the  stationary  distribution  for  each  parameter  is 
the  distribution  function  of  the  parameter.  The  chain  will  therefore  result  in  the 
knowledge  of  the  best-fit  value  of  the  parameter,  and  of  confidence  intervals,  making 
use  of  the  information  provided  by  the  dataset. 


©  Springer  Science+Busines  Media  New  York  2017 

M.  Bonamente,  Statistics  and  Analysis  of  Scientific  Data ,  Graduate  Texts 

in  Physics,  DOI  10.1007/978-l-4939-6572-4_15 


237 


238 


15  Introduction  to  Markov  Chains 


15.2  Mathematical  Properties  of  Markov  Chains 

A  stochastic  process  is  defined  as  a  sequence  of  variables  Xt, 

{Xt,  for  teT}  (15.1) 

where  t  labels  the  sequence.  The  domain  for  the  index  t  is  indicated  as  T  to  signify 
“time.”  The  domain  is  usually  a  subset  of  the  real  numbers  (T  C  R)  or  of  the  natural 
numbers  (T  C  N).  As  time  progresses,  the  random  variables  Xt  change  value,  and 
the  stochastic  process  describes  this  evolution. 

A  Markov  chain  is  a  particular  stochastic  process  that  satisfies  the  following 
properties: 

1.  The  time  domain  is  the  natural  numbers  (T  C  N),  and  each  random  variable 
Xt  can  have  values  in  a  countable  set,  e.g.,  the  natural  numbers  or  even  an 
n-dimensional  space  (N;l),  but  not  real  numbers  (Rn).  A  typical  example  of  a 
Markov  chain  is  one  in  which  Xt  =  n,  where  both  i  (the  time  index)  and  n  (the 
value  of  the  random  variable)  are  natural  numbers.  Therefore  a  Markov  chain 
takes  the  form  of 


X\  — y  X2  — y  X3  — y  . . .  — y  Xn  — y  . . . 

The  random  variable  Xt  describes  the  state  of  the  system  at  time  t  —  i.  The 
fact  that  Markov  chains  must  be  defined  by  way  of  countable  sets  may  appear 
an  insurmountable  restriction,  since  it  would  appear  that  the  natural  domain  for 
an  ^-parameter  space  is  R".  While  a  formal  extension  of  Markov  chains  to 
is  also  possible,  this  is  not  a  complication  for  any  practical  application,  since 
any  parameter  space  can  be  somehow  “binned”  into  a  finite  number  of  states. 
For  example,  the  position  of  the  person  in  a  random  walk  was  “binned”  into  a 
number  of  finite  (or  infinite  but  countable)  positions,  and  a  similar  process  can 
be  applied  to  virtually  any  parameter  of  interest  for  a  given  model.  This  means 
that  the  variable  under  consideration  can  occupy  one  of  a  countable  multitude 
of  states  £1,  £2,.. . ,  sn,  . . . ,  and  the  random  variable  Xt  identifies  the  state  of  the 
system  at  time  step  /,  Xt  —  sn. 

2.  A  far  more  important  property  that  makes  a  stochastic  process  a  Markov  chain  is 
the  fact  that  subsequent  steps  in  the  chain  are  only  dependent  on  the  current  state 
of  the  chain,  and  not  on  any  of  its  previous  history.  This  “short  memory”  property 
is  known  as  the  Markovian  property ,  and  it  is  the  key  into  the  construction  of 
Markov  chains  for  the  purpose  of  data  analysis.  In  mathematical  terms,  given 
the  present  time  t  —  n,  the  future  state  of  the  chain  at  t  =  n  +  1  (X„+i) 
depends  only  on  the  present  time  (Xn),  but  not  on  past  history.  Much  of  the  efforts 
in  the  construction  of  a  Monte  Carlo  Markov  chain  lies  in  the  identification 
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of  a  transition  probability  from  state  e*  to  state  £j  between  consecutive  time 
steps, 


Pij  =  P(Xn+ 1  =  Sj/Xn  =  Si).  (15.2) 

A  Markov  chain  requires  that  this  probability  be  time-independent,  and  therefore 
a  Markov  chain  has  the  property  of  time  homogeneity.  In  Chap.  16  we  will  see 
how  the  transition  probability  takes  into  account  the  likelihood  of  the  data  Z  with 
the  model. 

The  two  properties  described  above  result  in  the  fact  that  Markov  chain  is  a 
sequence  of  states  determined  by  transition  probabilities  ptj  (also  referred  to  as 
transition  kernel)  that  are  fixed  in  time.  The  ultimate  goal  is  to  determine  the 
probability  to  find  the  system  in  each  of  the  allowed  states.  With  an  eye  towards 
future  applications  for  the  analysis  of  data,  each  state  may  represent  values  of  one 
or  many  parameters,  and  therefore  a  Markov  chain  makes  it  possible  to  reconstruct 
the  probability  distribution  of  the  parameters. 

Example  15.1  ( Random  Walk)  The  random  walk  is  a  Markov  chain  that  represents 
the  location  of  a  person  who  randomly  takes  a  step  of  unit  length  forward  with 
probability  p,  or  a  step  backward  with  probability  q  =  1  —  p  (typically  p  —  q  — 
1/2).  The  state  of  the  system  is  defined  by  the  location  i  at  which  the  person  find 
itself  at  time  t  —  n. 


Xn  —  {Location  i  along  the  N+  axis} 


where  N+  indicates  all  positive  and  negative  integers.  For  this  chain,  the  time 
domain  is  the  set  of  positive  numbers  (T  —  N),  and  the  position  can  be  any  negative 
or  positive  integer  (N+).  The  transition  probability  describes  the  fact  that  the  person 
can  only  take  either  a  step  forward  or  backward: 


P 

<  q 
o 


if  j  =  i  +  1 ,  or  move  forward 
if  j  =  i  —  1 ,  or  move  backward 
otherwise. 


(15.3) 


❖ 

The  chain  satisfies  the  Markovian  property,  since  the  transition  probability  depends 
only  on  its  present  position,  and  not  on  previous  history. 

Example  15.2  Another  case  of  a  Markov  chain  is  a  simple  model  of  diffusion, 
known  as  the  Ehrenfest  chain.  Consider  two  boxes  with  a  total  of  m  balls.  At  each 
time  step,  one  selects  a  ball  at  random  from  either  box,  and  replaces  it  in  the  other 
box.  The  state  of  the  system  can  be  defined  via  the  random  variable 


Xn  —  {Number  of  balls  in  the  first  box}. 
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The  random  variable  can  have  only  a  finite  number  of  values  (0, 1 , . . . ,  m).  At  each 
time  step,  the  transition  probability  is 


r  m  —  i 


V  m 


if  j  =  i  +  1  (box  had  i  balls,  now  has  i  +  1) 
if  7  =  i  —  1  (box  had  i  balls,  now  has  i  —  1). 


(15.4) 


For  example,  in  the  first  case  it  means  that  we  chose  one  of  m  —  i  balls  from  the 
second  box.  The  transition  probabilities  depend  only  on  the  number  of  balls  in  the 
first  box  at  any  given  time,  and  are  completely  independent  of  how  the  box  came  to 
have  that  many  balls.  This  chain  therefore  satisfies  the  Markovian  property.  <> 
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We  are  interested  in  knowing  how  often  a  state  is  visited  by  the  chain  and,  in 
particular,  whether  a  given  state  can  be  visited  infinitely  often.  Assume  that  the 
system  is  initially  in  state  £/.  We  define  Uk  the  probability  that  the  system  returns  to 
the  initial  state  in  exactly  k  time  steps,  and  vn  the  probability  that  the  system  returns 
to  the  initial  state  at  time  n ,  with  the  possibility  that  it  may  have  returned  there  other 
times  prior  to  n.  Clearly,  it  is  true  that  vn  >  un. 

To  determine  whether  a  state  is  recurrent  or  transient,  we  define 


oo 


n=  1 


(15.5) 


as  the  probability  of  the  system  returning  the  initial  state  £/  for  the  first  time  at 
some  time  n.  The  state  can  be  classified  as  recurrent  or  transient  according  to  the 
probability  of  returning  to  that  state: 


u  —  1  state  is  recurrent; 
u  <  1  state  is  transient. 


(15.6) 


Therefore  a  recurrent  state  is  one  that  will  certainly  be  visited  again  by  the  chain. 
Notice  that  no  indication  is  given  as  to  the  time  at  which  the  system  will  return  to 
the  initial  state. 

We  also  state  a  few  theorems  that  are  relevant  to  the  understanding  of  recurrent 
states.  Proofs  of  these  theorems  can  be  found,  for  example,  in  the  textbook  by  Ross 
[38]  or  other  books  on  stochastic  processes,  and  are  not  reported  here. 
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Theorem  15.1  With  vn  the  probability  that  the  system  returns  to  a  state  St  at  time  n, 

oo 

state  Si  is  recurrent  E  vn  —  oo.  (15.7) 

n=  1 


This  theorem  states  that,  if  the  system  does  return  to  a  given  state,  then  it  will 
do  so  infinitely  often.  Also,  since  this  is  a  necessary  and  sufficient  condition,  any 
transient  state  will  not  be  visited  by  the  chain  an  infinite  number  of  times.  This 
means  that  transient  states  will  not  be  visited  any  more  after  a  given  time,  i.e.,  they 
are  only  visited  during  an  initial  period.  The  fact  that  recurrent  states  are  visited 
infinitely  often  means  that  it  is  possible  to  construct  a  sample  distribution  function 
for  recurrent  states  with  a  precision  that  is  function  of  the  length  of  the  chain.  No 
information  is,  however,  provided  on  the  timing  of  the  return  to  a  recurrent  state. 

We  also  introduce  the  definition  of  accessible  states:  a  state  sj  is  said  to  be 
accessible  from  state  Si  if  Pij(m)  >  0  for  some  natural  number  m ,  meaning  that 
there  is  a  non-zero  probability  of  reaching  this  state  from  another  state  in  m  time 
steps.  The  following  theorems  establish  properties  of  accessible  states,  and  how  the 
property  of  accessibility  relates  to  that  of  recurrence. 

Theorem  15.2  If  a  state  Sj  is  accessible  from  a  recurrent  state  St,  then  Sj  is  also 
recurrent,  and  Si  is  accessible  from  Sj. 

This  theorem  states  that  once  the  system  reaches  a  recurrent  state,  the  states 
visited  previously  by  the  chain  must  also  be  recurrent,  and  therefore  will  be  visited 
again  infinitely  often.  This  means  that  recurrent  states  form  a  network,  or  class,  of 
states  that  share  the  property  of  recurrence,  and  these  are  the  states  that  the  chain 
will  sample  over  and  over  again  as  function  of  time. 

Theorem  15.3  If  a  Markov  chain  has  a  finite  number  of  states,  then  each  state  is 
accessible  from  any  other  state,  and  all  states  are  recurrent. 

This  theorem  ensures  that  all  states  in  a  finite  chain  will  be  visited  infinitely  often, 
and  therefore  the  chain  will  sample  all  states  as  function  of  time.  This  property  is 
of  special  relevance  for  Monte  Carlo  Markov  chain  methods  in  which  the  states  of 
the  chain  are  possible  values  of  the  parameters.  As  the  chain  progresses,  all  values 
of  the  parameters  are  accessible,  and  will  be  visited  in  proportion  of  the  posterior 
distribution  of  the  parameters. 

Example  15.3  ( Recurrence  of  States  of  the  Random  Walk)  Consider  the  random 
walk  with  transition  probabilities  given  by  (15.3).  We  want  to  determine  whether 
the  initial  state  of  the  chain  is  a  recurrent  or  a  transient  state  for  the  chain. 
The  probability  of  returning  to  the  initial  state  in  k  steps  is  clearly  given  by  the 
binomial  distribution, 


0 


if  k  is  odd 


Pu(k)  = 


C(n,  k)pnqn  if  k  —  2n  is  even 


(15.8) 


242 


15  Introduction  to  Markov  Chains 


where 


C(n,k )  = 


kl 


(k  —  n)\n\ 


(15.9) 


is  the  number  of  combinations  [of  n  successes  out  of  k  =  2n  tries,  see  (3.3)].  Using 
Stirling’s  approximation  for  the  factorial  function  in  the  binomial  coefficient, 


n\  —  V2rtnnne  11 , 


the  probability  to  return  at  time  k  —  Into  the  initial  state  becomes 


(2 n)\  V 4 Jin  ( 2n)lne  2n 

WP  9  -  2 n**-*  P  q 


(4 pq)n 

y/nn 


which  holds  only  for  k  even. 

This  equation  can  be  used  in  conjunction  with  Theorem  15.1  to  see  if  the  initial 
state  is  transient  or  recurrent.  Consider  the  series 


oo 

n=  1 


E 


(4 pq)n. 


According  to  Theorem  15.1,  the  divergence  of  this  series  is  a  necessary  and 
sufficient  condition  to  prove  that  the  initial  state  is  recurrent. 

(a)  p  q.  In  this  case,  v  =  4 pq  <  1  and 


OO  oo 

E  -7=  (4 -M)n  <  E^  = 

Jjin  \  —  x 


since  v  <  1 ,  the  series  converges  and  therefore  the  state  is  transient.  This  means 
that  the  system  may  return  to  the  initial  state,  but  only  for  a  finite  number  of 
times,  even  after  an  infinite  time.  Notice  that  as  time  progresses  the  state  of  the 
system  will  drift  in  the  direction  that  has  a  probability  >  1/2. 

(b)  p  —  q  —  1/2,  thus  4 pq  —  1 .  The  series  becomes 


1 


E 


1 


(15.10) 


It  can  be  shown  (see  Problem  15.2)  that  this  series  diverges,  and  therefore  a  random 
walk  with  the  same  probability  of  taking  a  step  to  the  left  or  to  the  right  will  return 
to  the  origin  infinitely  often.  <> 
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15.4  Limiting  Probabilities  and  Stationary  Distribution 

The  ultimate  goal  of  a  Markov  chain  is  to  calculate  the  probability  that  a  system 
occupies  a  given  state  £/  after  a  large  number  n  of  steps.  This  probability  is  called 
the  limiting  probability.  According  to  the  frequentist  approach  defined  in  (1.2),  it  is 
given  by 


pi  =  lim  pj(n),  (15.11) 

J  n^o O 

where  Pj(n)  is  the  probability  of  the  system  to  be  found  in  state  sj  at  time  t  —  n. 
With  the  aid  of  the  total  probability  theorem,  the  probability  of  the  system  to  be  in 
state  €j  at  time  t  —  n  is 


Pj(n)  =  'pjvAn  -  1  )pkj.  (15.12) 

k 

In  fact  Pk(n  —  1)  represents  the  probability  of  being  in  state  £k  at  time  n  —  1,  and 
the  set  of  probabilities  pk(n  —  1)  forms  a  set  of  mutually  exclusive  events  which 
encompasses  all  possible  outcomes,  with  the  index  k  running  over  all  possible  states. 
This  formula  can  be  used  to  calculate  recursively  the  probability  pj(n )  using  the 
probability  at  the  previous  step  and  the  transition  probabilities  pkj ,  which  do  not 
vary  with  time. 

Equation  (15.12)  can  be  written  in  a  different  form  if  the  system  is  known  to 
be  in  state  £*■  at  an  initial  time  t  —  0: 

Pij(n)  =  P(X„  =  sj )  =  y ~^Pik(n  -  1  )pkj  (15.13) 

k 

where  Pij(n)  is  the  probability  of  the  system  going  from  state  £*■  to  £j  in  n  time 
steps. 

The  probabilities  pj(n)  and  Pij{n)  change  as  the  chain  progresses.  The 
limiting  probabilities  pj,  on  the  other  hand,  are  independent  of  time,  and 
they  form  the  stationary  distribution  of  the  chain.  General  properties  for 
the  stationary  distribution  can  be  given  for  Markov  chains  that  have  certain 
specific  properties.  In  the  following  we  introduce  additional  definitions  that 
are  useful  to  characterize  Markov  chains,  and  to  determine  the  stationary 
distribution  of  the  chain. 

A  number  of  states  that  are  accessible  from  each  other,  meaning  there  is 
a  non-zero  probability  to  reach  one  state  from  the  other  (p^  >  0),  are  said 
to  communicate ,  and  all  states  that  communicate  are  part  of  the  same  class. 
The  property  of  communication  (<^)  is  an  equivalence  relation,  meaning  that 
it  obeys  the  following  three  properties: 
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(a)  The  reflexive  property:  i  o  i\ 

(b)  The  symmetric  property:  if  i  o  j,  then j  o  i',  and 

(c)  The  transitive  property:  if  i  o  j  and  j  o  k,  then  i  k.  Therefore,  each 
class  is  separate  from  any  other  class  of  the  same  chain.  A  chain  is  said 
to  be  irreducible  if  it  has  only  one  class,  and  thus  all  states  communicate 
with  each  other. 

Another  property  of  Markov  chains  is  periodicity.  A  state  is  said  to  be 
periodic  with  period  T  if  Pn(n)  —  0  when  n  is  not  divisible  by  T,  and  T  is  the 
largest  such  integer  with  this  property.  This  means  that  the  return  to  a  given 
state  must  occur  in  multiples  of  T  time  steps.  A  chain  is  said  to  be  aperiodic  if 
T  —  1,  and  return  to  a  given  state  can  occur  at  any  time.  It  can  be  shown  that 
all  states  in  a  class  share  the  same  period. 

The  uniqueness  of  the  stationary  distribution  and  an  equation  that  can  be 
used  to  determine  it  are  established  by  the  following  theorems. 

Theorem  15.4  An  irreducible  aperiodic  Markov  chain  belongs  to  either  of 
the  following  two  classes: 

1.  All  states  are  positive  recurrent.  In  this  case ,  p*  —  1 i{  is  the  stationary 
distribution,  and  this  distribution  is  unique. 

2.  All  states  are  transient  or  null  recurrent;  in  this  case,  there  is  no  stationary 
distribution. 

This  theorem  establishes  that  a  “well  behaved”  Markov  chain,  i.e.,  one  with 
positive  recurrent  states,  does  have  a  stationary  distribution,  and  that  this 
distribution  is  unique.  Positive  recurrent  states,  defined  in  Sect.  15.3,  are  those 
for  which  the  expected  time  to  return  to  the  same  state  is  finite,  while  the  time 
to  return  to  a  transient  or  null  recurrent  state  is  infinite.  This  theorem  also 
ensures  that,  regardless  of  the  starting  point  of  the  chain,  the  same  stationary 
distribution  will  eventually  be  reached. 

Theorem  15.5  The  limiting  probabilities  are  the  solution  of  the  system  of  linear 
equations 


N 

Pj  =  ^P'Pij-  (15.14) 

i=  1 

Proof  According  to  the  recursion  formula  (15.12), 

N 

Pj{ri)  =  52  -  1  )Pij-  (15.15) 

i=  1 

Therefore  the  result  follows  by  taking  the  limit  n  ->  00  of  the  above  equation. 

□ 
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If  we  consider  a  chain  with  a  probability  distribution  at  a  time  to  that 
satisfies 


N 

Pj(t o)  =  YtPitoP*  (15.16) 

1=1 

then  the  probability  distribution  of  the  states  satisfies  (15.14),  and  the  chain 
has  reached  the  stationary  distribution  Pj(n)  —  pj.  Theorem  15.5  guarantees 
that,  from  that  point  on,  the  chain  will  maintain  its  stationary  distribution. 
The  importance  of  a  stationary  distribution  is  that,  as  time  elapses,  the 
chain  samples  this  distribution.  The  sample  distribution  of  the  chain,  e.g.,  a 
hystogram  plot  of  the  occurrence  of  each  state,  can  therefore  be  used  as  an 
approximation  of  the  posterior  distribution. 

Example  15.4  (Stationary  Distribution  of  the  Ehrenfest  Chain )  We  want  to  find  a 
distribution  function  pj  that  is  the  stationary  distribution  of  the  Ehrenfest  chain.  This 
case  is  of  interest  because  the  finite  number  of  states  makes  the  calculation  of  the 
stationary  distribution  easier  to  achieve  analytically.  The  condition  for  a  stationary 
distribution  is 


N 

EPiPv 


i=  1 


where  N  is  the  number  of  states  of  the  chain.  The  condition  can  also  be  written  in 
matrix  notation.  Recall  that  the  transition  probabilities  for  the  Ehrenfest  chain  are 


m  —  i 


v  m 


if  j  =i  +  1 


and  they  can  be  written  as  a  transition  matrix  P 


p  =  \pu ]  = 


0 

1 

0 

0 

o 

o 

1 

0 

m  —  1 

0 

o 

o 

m 

0 

2 

m 

0 

m  —  2 

o 

o 

m 

m 

0 

0 

0 

0 

...10 

(15.17) 


Notice  that  the  sum  of  each  line  is  one,  since 
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is  the  probability  of  going  from  state  e*  to  any  state  Sj.  In  (15.17)  you  can  regard  the 
vertical  index  to  be  i  =  0, . . . ,  m,  and  the  horizontal  index  j  =  0, . . . ,  m. 

The  way  in  which  we  typically  use  (15.14)  is  simply  to  verify  whether  a 
distribution  is  the  stationary  distribution  of  the  chain.  In  the  case  of  the  Ehrenfest 
chain,  we  try  the  binomial  distribution  as  the  stationary  distribution, 


in  which  p  and  q  represent  the  probability  of  finding  a  ball  in  either  box.  At 
equilibrium  one  expects  p  =  q  —  1/2,  since  even  an  initially  uneven  distribution  of 
balls  between  the  two  boxes  should  result  in  an  even  distribution  at  later  times.  It  is 
therefore  reasonable  to  expect  that  the  probability  of  having  i  balls  in  the  first  box, 
out  of  a  total  of  m,  is  equivalent  to  that  of  i  positive  outcomes  in  a  binary  experiment. 

To  prove  this  hypothesis,  consider  p  =  \po,Pi,  •  •  •  ,pm]  as  a  row  vector  of 
dimension  m  +  1 ,  and  verify  the  equation 

p  =  pP,  (15.18) 

which  is  the  matrix  notation  for  the  condition  of  a  stationary  distribution.  For  the 
Erhenfest  chain,  this  condition  is 


\P0>  P 1  >  •  •  •  ?  Pm\ 


\P0?  P  !?•••?  Pm\ 


0 

1 

0 

0 

o 

o 

1 

0 

m  —  1 

0 

o 

o 

m 

0 

2 

m 

0 

m  —  2 

o 

o 

m 

m 

0 

0 

0 

0 

...10 

For  a  given  state  i,  only  two  terms  (at  most)  contribute  to  the  sum, 

Pi  =  Pi-\Pi-u  +  Pi+iPi+u-  (15.19) 


From  this  we  can  prove  that  the  p  —  q  —  1/2  binomial  is  the  stationary  distribution 
of  the  Ehrenfest  chain  (see  Problem  15.1).  <> 
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Summary  of  Key  Concepts  for  this  Chapter 

□  Markov  chain :  A  stochastic  process  or  sequence  of  random  variables  as 
function  of  an  integer  time  variable. 

□  Markovian  property :  It  is  the  key  property  of  Markov  chains,  stating  that 
the  state  of  the  system  at  a  given  time  depends  only  on  the  state  at  the 
previous  time  step. 

□  Recurrent  and  transient  state :  A  recurrent  state  occurs  infinitely  often 
while  a  transient  state  only  occurs  a  finite  number  of  times  in  the  Markov 
chain. 

□  Stationary  distribution :  It  is  the  asymptotic  distribution  of  each  variable, 
obtained  after  a  large  number  of  time  steps  of  the  Markov  chain.  When 
the  variable  represents  a  model  parameter,  the  stationary  distribution  is  the 
posterior  distribution  of  the  parameter. 


Problems 

15.1  Consider  the  Markov  chain  for  the  Ehrenfest  chain  described  in  Example  15.4. 
Show  that  the  stationary  distribution  is  the  binomial  with  p  —  q  —  l/i. 

15.2  Show  that  the  random  walk  with  p  —  q  —  1/2(15.10)  returns  to  the  origin 
infinitely  often,  and  therefore  the  origin  is  a  recurrent  state  of  the  chain. 

15.3  For  the  random  walk  with  p  ^  p,  show  that  the  origin  is  a  transient  state. 

15.4  Assume  that  the  diffusion  model  of  Example  15.2  is  modified  in  such  a  way 
that  at  each  time  step  one  has  the  option  to  choose  one  box  at  random  from  which 
to  replace  a  ball  to  the  other  box. 

(a)  Determine  the  transition  probabilities  pij  for  this  process. 

(b)  Determine  whether  this  process  is  a  Markov  chain. 

15.5  Using  the  model  of  diffusion  of  Problem  15.4,  determine  if  the  binomial 
distribution  with  p  —  q  —  l/i  is  the  stationary  distribution. 


Chapter  16 

Monte  Carlo  Markov  Chains 


Abstract  Monte  Carlo  Markov  Chains  (MCMC)  are  a  powerful  method  to  ana¬ 
lyze  scientific  data  that  has  become  popular  with  the  availability  of  modern-day 
computing  resources.  The  basic  idea  behind  an  MCMC  is  to  determine  the 
probability  distribution  function  of  quantities  of  interest,  such  as  model  parameters, 
by  repeatedly  querying  datasets  used  for  their  measurement.  The  resulting  sequence 
of  values  form  a  Markov  chain  that  can  be  analyzed  to  find  best-fit  values  and 
confidence  intervals.  The  modern-day  data  analyst  will  find  that  MCMCs  are  an 
essential  tool  that  permits  tasks  that  are  simply  not  possible  with  other  methods, 
such  as  the  simultaneous  estimate  of  parameters  for  multi-parametric  models  of 
virtually  any  level  of  complexity,  even  in  the  presence  of  correlation  among  the 
parameters. 


16.1  Introduction  to  Monte  Carlo  Markov  chains 

A  typical  data  analysis  problem  is  the  fit  of  data  to  a  model  with  adjustable 
parameters.  Chapter  8  presented  the  maximum  likelihood  method  to  determine  the 
best-fit  values  and  confidence  intervals  for  the  parameters.  For  the  linear  regression 
to  a  two-variable  dataset,  in  which  the  independent  variable  is  assumed  to  be  known 
and  the  dependent  variable  has  errors  associated  with  its  measurements,  we  found 
an  analytic  solution  for  the  best-fit  parameters  and  its  uncertainties  (Sect.  8.3).  Even 
the  case  of  a  multiple  linear  regressions  is  considerably  more  complex  to  solve 
analytically  (Chap.  9)  and  most  fits  to  non-linear  functions  do  not  have  analytic 
solutions  at  all. 

When  an  analytic  solution  is  not  available,  the  Xmin  method  to  search  for 
best-fit  parameters  and  their  confidence  intervals  is  still  applicable,  as  described 
in  Sect.  10.3.  The  main  complication  is  the  computational  cost  of  sampling  the 
parameter  space  in  search  of  x2min  and  surfaces  of  constant  x2min  +  A/2.  Consider, 
for  example,  a  model  with  10  free  parameters:  even  a  very  coarse  sampling  of  10 
values  for  each  parameter  will  result  in  1010  evaluations  of  the  likelihood,  or  /2,  to 
cover  the  entire  parameter  space.  Moreover,  it  is  not  always  possible  to  improve  the 
situation  by  searching  for  just  a  few  interesting  parameters  at  a  time,  e.g.,  fixing  the 
value  of  the  background  while  searching  for  the  flux  of  the  source.  In  fact,  there  may 
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be  correlation  among  parameters  and  this  requires  that  the  parameters  be  estimated 
simultaneously. 

The  Monte  Carlo  Markov  chain  (MCMC)  methods  presented  in  this  chapter 
provide  a  way  to  bypass  altogether  the  need  for  a  uniform  sampling  of  parameter 
space.  This  is  achieved  by  constructing  a  Markov  chain  that  only  samples  the 
interesting  region  of  parameters  space,  i.e.,  the  region  near  the  maximum  of  the 
likelihood.  The  method  is  so  versatile  and  computationally  efficient  that  MCMC 
techniques  have  become  the  leading  analysis  method  in  many  fields  of  data  analysis. 


16.2  Requirements  and  Goals  of  a  Monte  Carlo 
Markov  Chain 

A  Monte  Carlo  Markov  chain  makes  use  of  a  dataset  Z  and  a  model  with  m 
adjustable  parameters,  0  =  (0\, ... ,  0m) ,  for  which  it  is  possible  to  calculate  the 
likelihood 


Jf  =  P(Z/Q). 


(16.1) 


Usually,  the  calculation  of  the  likelihood  is  the  most  intensive  task  for  an  MCMC. 
It  necessary  to  be  able  to  evaluate  the  likelihood  for  all  possible  parameter  values. 

According  to  Bayesian  statistic,  one  is  allowed  to  have  a  prior  knowledge  on  the 
parameters,  even  before  they  are  measured  (see  Sect.  1.7).  The  prior  knowledge  may 
come  from  experiments  that  were  conducted  beforehand,  or  from  any  other  type  a 
priori  belief  on  the  parameters.  The  prior  probability  distribution  will  be  referred  to 
as  p(0). 

The  information  we  seek  is  the  probability  distribution  of  the  model  parameters 
after  the  measurements  are  made,  i.e.,  the  posterior  distribution  P(6/Z).  According 
to  Bayes’  theorem,  the  posterior  distribution  is  given  by 


P(0/Z)  = 


P{6)P(Z/6) 

P(Z) 


P(0)  •  c 
P(Z) 


(16.2) 


where  the  quantity  P(Z)  —  f  P(Z/ 0)P(0)d0  is  a  normalization  constant. 

Taken  at  face  value,  (16.2)  appears  to  be  very  complicated,  as  it  requires  a  multi¬ 
dimensional  integration  of  the  term  P(Z).  The  alternative  provided  by  a  Monte 
Carlo  Markov  chain  is  the  construction  of  a  sequence  of  dependent  samples  for  the 
parameters  6  in  the  form  of  a  Markov  chain.  Such  Markov  chain  is  constructed 
in  such  a  way  that  each  parameter  value  appears  in  the  chain  in  proportion  to 
this  posterior  distribution.  With  this  method,  it  will  be  shown  that  the  value  of  the 
normalization  constant  P(Z)  becomes  unimportant,  thus  alleviating  significantly  the 
computational  burden.  The  goal  of  a  Monte  Carlo  Markov  chain  is  therefore  that 
of  creating  a  sequence  of  parameter  values  that  has  as  its  stationary  distribution  the 
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posterior  distribution  of  the  parameters.  After  the  chain  is  run  for  a  large  number 
of  iterations,  the  posterior  distribution  is  obtained  via  the  sample  distribution  of  the 
parameters  in  the  chain. 

There  are  several  algorithms  to  sample  the  parameter  space  that  satisfy  the 
requirement  of  having  the  posterior  distribution  of  the  parameters  P(6/Z)  as  the 
stationary  distribution  of  the  chain.  A  very  common  algorithm  that  can  be  used  in 
most  applications  is  that  of  Metropolis  and  Hastings  [19,  32].  It  is  surprisingly  easy 
to  implement,  and  therefore  constitutes  a  reference  for  any  MCMC  implementation. 
Another  algorithm  is  that  of  Gibbs,  but  its  use  is  limited  by  certain  specific  require¬ 
ments  on  the  distribution  function  of  the  parameters. Both  algorithms  presented  in 
this  chapter  provide  a  way  to  sample  values  of  the  parameters  and  describe  a  way  to 
accept  them  into  the  Markov  chain. 


16.3  The  Metropolis-Hastings  Algorithm 

The  Metroplis-Hastings  algorithm  [19,  32]  was  devised  well  before  personal  com¬ 
puters  became  of  widespread  use.  In  this  section  we  first  describe  the  algorithm  and 

then  prove  that  the  resulting  Markov  chain  has  the  desired  stationary  distribution. 

The  method  has  the  following  steps. 

1 .  The  Metropolis-Hastings  algorithm  starts  with  an  arbitrary  choice  of  the  initial 
values  of  the  model  parameters,  Oq  =  (0{\  . . . ,  9%)-  This  initial  set  of  parameters 
is  automatically  accepted  into  the  chain.  As  will  be  explained  later,  some  of  the 
initial  links  in  the  MCMC  will  later  be  discarded  to  offset  the  arbitrary  choice  of 
the  starting  point. 

2.  A  candidate  for  the  next  link  of  the  chain,  O',  is  then  drawn  from  a  proposal 
(or  auxiliary)  distribution  q(0' /0n),  where  9n  is  the  current  link  in  the  chain. 
The  distribution  q(0' / 0n)  is  the  probability  of  drawing  a  given  candidate  O', 
given  that  the  chain  is  in  state  0n.  There  is  a  large  amount  of  freedom  in  the 
choice  of  the  auxiliary  distribution,  which  can  depend  on  the  current  state  of 
the  chain  0n,  according  to  the  Markovian  property,  but  not  on  its  prior  history. 
One  of  the  simplest  choices  for  a  proposal  distribution  is  an  m- dimensional 
uniform  distribution  of  fixed  width  in  the  neighborhood  of  the  current  parameter. 
A  uniform  prior  is  very  simple  to  implement,  and  it  is  the  default  choice  in  many 
applications.  More  complex  candidate  distributions  can  be  implemented  using, 
e.g.,  the  method  of  simulation  of  variables  described  in  Sect.  4.8. 

3.  A  prior  distribution  p(0)  has  to  be  assumed  before  a  decision  can  be  made 
whether  the  candidate  is  accepted  into  the  chain  or  rejected.  The  Metropolis- 
Hastings  algorithm  gives  freedom  on  the  choice  of  the  prior  distribution  as  well. 
A  typical  choice  of  prior  is  another  uniform  distribution  between  two  hard  limits, 
enforcing  a  prior  knowledge  that  a  given  parameter  may  not  exceed  certain 
boundaries.  Sometimes  the  boundaries  are  set  by  nature  of  the  parameter  itself, 
e.g.,  certain  parameter  may  only  be  positive  numbers,  or  in  a  fixed  interval  range. 
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Other  priors  may  be  more  restrictive.  Consider  the  case  of  the  measurement  of 
the  slope  of  the  curve  in  the  Hubble  experiment  presented  on  page  157.  It  is  clear 
that,  after  a  preliminary  examination  of  the  data,  the  slope  parameter  b  will  not 
be  a  negative  number,  and  will  not  be  larger  than,  say,  b  —  2.  Therefore  one  can 
safely  assume  a  prior  on  this  parameter  equal  to  p(b)  =  1/2,  for  0  <  b  <  2. 
Much  work  on  priors  has  been  done  by  Jeffreys  [23],  in  search  of  mathematical 
functions  that  express  the  lack  of  prior  knowledge,  known  as  Jeffreys  priors. 
For  many  applications,  though,  simple  uniform  prior  distributions  are  typically 
sufficient. 

4.  After  drawing  a  random  candidate  O',  we  must  decide  whether  to  accept  it  into 
the  chain  or  reject  it.  This  choice  is  made  according  to  the  following  acceptance 
probability,  which  is  the  heart  of  the  Metropolis-Hastings  algorithm: 


a  (O' /6n)  —  min 


n(6')q(en/6')  ) 

n(dn)q(d'/9ny  )’ 


(16.3) 


The  acceptance  probability  a(0'/0n)  determines  the  probability  of  going  from 
0n  to  the  new  candidate  state  O',  where  q(0'/0n)  is  the  proposal  distribution, 
and  tv  (O')  =  P(0/Z)  is  the  intended  stationary  distribution  of  the  chain. 
Equation  (16.3)  means  that  the  probability  of  going  to  a  new  value  in  the  chain, 
is  proportional  to  the  ratio  of  the  posterior  distribution  of  the  candidate  to  that 
of  the  previous  link.  The  acceptance  probability  can  also  be  re-written  by  making 
use  of  Bayes’  theorem  (16.2),  as 


ct(0'/0n) 


j  p(9')P(Z/d')g(9n/9')  ) 

( p{0n)P{Z/6n)q{6'/6n)  ’  ) 


(16.4) 


In  this  form,  the  acceptance  probability  can  be  calculated  based  on  known 
quantites.  The  term  p(0n)q(0' / 0n)  at  the  denominator  represents  the  probability 
of  occurrence  of  a  given  candidate  0'\  in  fact,  the  first  term  is  the  prior  probability 
of  the  n- th  link  in  the  chain,  and  the  second  term  is  the  probability  of  generating 
the  candidate,  once  the  chain  is  at  that  state.  The  other  term,  C  —  P(Z/On ), 
is  the  likelihood  of  the  current  link  in  the  chain.  At  the  numerator,  all  terms 
have  reverse  order  of  conditioning  between  the  current  link  and  the  candidate. 
Therefore  all  quantities  in  (16.4)  are  known,  since  p(0n)  and  q(0' / 0n)  (and  their 
conjugates)  are  chosen  by  the  analyst  and  the  likelihood  can  be  calculated  for  all 
model  parameters. 

Acceptance  probability  means  that  the  candidate  is  accepted  in  the  chain  in 
proportion  to  the  value  of  a  (O' /  0n).  Two  cases  are  possible: 

•  a  =  1 :  This  means  that  the  candidate  will  be  accepted  in  the  chain,  since  the 
probability  of  acceptance  is  100  %.  The  candidate  becomes  the  next  link  in  the 
chain,  on+ 1  -  ef.  The  min  operator  guarantees  that  the  probability  is  never 
greater  than  1,  which  would  not  be  meaningful. 
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•  o'  <  1 :  This  means  that  the  candidate  can  only  be  accepted  in  the  chain  with  a 
probability  a.  To  enforce  this  probability  of  acceptance,  it  is  sufficient  to  draw 
a  random  number  0  <  u  <  1  and  then  accept  or  reject  the  candidate  according 
to  the  following  criterion: 


if  a  >  u  =>>  candidate  is  accepted,  6n+ 1  =  9' 
if  a  <  u  =>  candidate  is  rejected,  6n+ \  =  0n  . 


(16.5) 


It  is  important  to  notice  that  if  the  candidate  is  rejected,  then  the  chain  doesn’t 
move  from  its  current  location  and  a  new  link  equal  to  the  previous  one  is  added 
to  the  chain.  This  means  that  at  each  time  step  in  the  chain  a  new  link  is  added, 
either  by  repeating  the  last  link  (if  the  candidate  is  rejected)  or  by  adding  a 
different  link  (if  the  candidate  is  accepted). 

The  logic  of  the  Metropolis-Hastings  algorithm  can  be  easily  understood  in  the 
case  of  uniform  priors  and  auxiliary  distributions.  In  that  case,  the  candidate  is 
accepted  in  proportion  to  just  the  ratio  of  the  likelihoods,  since  all  other  terms  in 
(16.3)  cancel  out: 


a(0'/ 0„)  =  min  |  )  ’ 1 1  ~  (16-6) 

If  the  candidate  has  a  higher  likelihood  than  the  current  link,  it  is  automatically 
accepted.  If  the  likelihood  of  the  candidate  is  lower  than  the  likelihood  of  the  current 
link,  then  it  is  accepted  in  proportion  to  the  ratio  of  the  likelihoods  of  the  candidate 
and  of  the  current  link.  The  possibility  of  accepting  a  parameter  of  lower  likelihood 
permits  a  sampling  of  the  parameter  space,  instead  of  a  simple  search  for  the  point 
of  maximum  likelihood  which  would  only  result  in  a  point  estimate. 

We  now  show  that  use  of  the  Metropolis-Hastings  algorithm  creates  a  Markov 
chain  that  has  i r(6n)  —  P(On/Z )  as  its  stationary  distribution.  For  this  purpose,  we 
will  show  that  the  posterior  distribution  of  the  parameters  satisfies  the  relationship 

x(0„  )  =  Y,  tv  ( Oj)pjn ,  (16.7) 

j 

where  pjn  are  the  transition  probabilities  of  the  Markov  chain  and  the  index  j  runs 
over  all  possible  states. 

Proof  (Justification  of  the  Metropolis-Hastings  Algorithm)  To  prove  that  the 
Metropolis-Hastings  algorithm  leads  to  a  Markov  chain  with  the  desired 
stationary  distribution,  consider  the  time-reversed  chain: 

original  chain:  Xq>  ->  X\  ->  X2  ->  . . .  ->  Xn  — >  . . . 

time-reversed  chain:  Xq  X\  . .  .Xn  Xw+i  


254 


16  Monte  Carlo  Markov  Chains 


The  time-reversed  chain  is  defined  by  the  transition  probability  /?*• : 


P*j  =  P(Xn  =  Ej/Xn+1 


P(Xn  —  Ej .  Xn  + 1  —  8 j ) 
P(x, 1+1  =  Si) 


P(Xn+ 1  =  Sj/Xn  =  Sj)P(Xn  =  Sj) 
P(X„+ 1  =  Si) 


leading  to  the  following  relationship  with  the  transition  probability  of  the 
original  chain: 


(16.8) 


If  the  original  chain  is  time-reversible ,  then  p*  =  p^,  and  the  time-reversed 
process  is  also  a  Markov  chain.  In  this  case,  the  stationary  distribution  will 
follow  the  relationship 


nidi)  ■  py  =  pfl  ■  Jt(9j)  (16.9) 

known  as  the  equation  of  detailed  balance.  The  detailed  balance  is  the 
hallmark  of  a  time-reversible  Markov  chain,  stating  that  the  probability  to 
move  forward  and  backwards  is  the  same,  once  the  stationary  distribution  is 
reached.  Therefore,  if  the  transition  probability  of  the  Metropolis-Hastings 
algorithm  satisfies  this  equation,  with  n  (0)  =  P(6/Z ),  then  the  chain 
is  time  reversible,  and  with  the  desired  stationary  distribution.  Moreover, 
Theorem  15.4  can  be  used  to  prove  that  this  distribution  is  unique. 

The  Metropolis-Hastings  algorithm  enforces  a  specific  transition  probabil¬ 
ity  between  states  0;  and  6j, 


Pij  =  qiSj/OiMOj/Oi )  if  6i  ±  Oj  (16.10) 

where  q  is  the  probability  of  generating  the  candidate  (or  proposal  distri¬ 
bution),  and  of  the  probability  of  accepting  it.  One  can  also  show  that  the 
probability  of  remaining  at  the  same  state  0/  is 

pa  =  i 

j^i 

where  the  sum  is  over  all  possible  states. 
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According  to  the  transition  probability  described  by  (16.3), 


pWPjz/eMOi/ej)  j 

pmPiz/edqiPj/Oi)’  f 


Jl(9j)q(di/9j)  j 
n(9iM9j/9iy  ( 


in  which  we  have  substituted  7r(ft)  =  p(6i/Z)  —  P(Z / 0i)p(0i) / p(Z)  as  the 
posterior  distribution.  Notice  that  the  probability  p(Z)  cancels  out,  therefore 
its  value  does  not  play  a  role  in  the  construction  of  the  chain. 

It  is  clear  that,  if  a  (ft /ft)  <  1,  then  a  (ft /ft)  —  1,  thanks  to  the  min 
operation.  Assume,  without  loss  of  generality,  that  a? (ft,  ft)  <  1: 


n{9j)q(9j/9j) 
n(9i)q(9j/9i ) 


a(9j/9i)n{9i)q{9j/9i)  =  n(9j)q(9i/ 9j)  •  a(9i/9:) 


Now,  since  we  assumed  a  (ft/ ft)  <  1,  the  operation  of  min  becomes 

redundant.  Using  (16.10)  the  previous  equation  simplifies  to 

Py  ■  n(9i)  =  Pji  ■  Jt(9j) 

which  shows  that  the  Metropolis-Hastings  algorithm  satisfies  the  detailed 
balance  equation;  it  thus  generates  a  time-reversible  Markov  chain,  with 
stationary  distribution  equal  to  the  posterior  distribution.  □ 

Example  16.1  The  data  from  Hubble’s  experiment  (page  157)  can  be  used  to  run  a 
Monte  Carlo  Markov  chain  to  obtain  the  posterior  distribution  of  the  parameters 
a  and  b.  This  fit  was  also  performed  using  a  maximum  likelihood  method  (see 
page  159)  in  which  the  common  uncertainty  in  the  dependent  variable,  log  v ,  was 
estimated  according  to  the  method  described  in  Sect.  8.5. 

Using  these  data,  a  chain  is  constructed  using  uniform  priors  on  the  two  fit 
parameters  a  and  b\ 


\p(a)  =  y 
\p(b)  =  10 


for  0.2  <  b  <  0.9 
for  0.15  <  a  <  0.25. 


The  proposal  distributions  are  also  uniform  distributions,  respectively,  of  fixed 
width  0. 1  and  0.02  for  a  and  b ,  and  centered  at  the  current  value  of  the  parameters: 


for  an  —  0.1  <  6n+\  <  an  +  0.1 
for  bn  —  0.02  <  6n+  i  <  bn  +  0.02 


p(9n+i/an)  =  5 
p(9„+i/b„)  =  25 
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in  which  an  and  bn  are,  respectively,  the  n- th  links  of  the  chain,  and  0n+  \  represent 
the  candidate  for  the  (n  +  l)-th  link  of  the  chain,  for  each  parameter. 

In  practice,  once  the  choice  of  a  uniform  distribution  with  fixed  width  is  made,  the 
actual  value  of  the  prior  and  proposals  distributions  are  not  used  explicitly.  In  fact, 
the  acceptance  probability  becomes  simply  a  function  of  the  ratio  of  the  likelihoods, 
or  of  the  /2,s: 


a  (O'  /  0 n)  —  min 


2£?(0  0 

K V 


(  x2(0n)-xHe') 

=  min  le  2  , 1 


<> 

where  X2(.@n)  and  X2(@')  are  the  minimum  /2,s  calculated,  respectively,  using  the 
n- th  link  of  the  chain  and  the  candidate  parameters  (Fig.  16.1). 

A  few  steps  of  the  chain  are  reported  in  Table  16.1.  Where  two  consecutive  links 
in  the  chain  are  identical,  it  is  an  indication  that  the  candidate  parameter  drawn  at 
that  iteration  was  rejected,  and  the  previous  link  was  therefore  repeated.  Figure  16.2 
shows  the  sample  distributions  of  the  two  fit  parameters  from  a  chain  with  100,000 
links.  A  wider  prior  on  parameter  a  would  make  it  possible  to  explore  further  the 
tails  of  the  distribution. 


2000  4000  6000  8000 

Iteration  number 


Fig.  16.1  MCMC  for  parameters  a ,  b  of  linear  model  fit  to  the  data  in  Table  8.1.  The  chain  was 
run  for  10,000  iterations,  using  uniform  priors  on  both  parameters  (between  0.15  and  0.25  for  a , 
and  0.2  and  0.9  for  b ).  The  chain  started  at  a  =  0.90  and  b  =  0.25.  The  proposal  distributions 
were  also  uniform,  with  width  of,  respectively,  0.2  for  a  and  0.04  for  b,  centered  at  the  current 
value  of  the  chain 
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Table  16.1  Sample  of  MCMC  chain  for  the  Hubble  data 


n 

a 

b 

X2(Qn) 

1 

0.90000 

0.25000 

3909.55420 

136 

0.80627 

0.18064 

11.47313 

2 

0.94116 

0.24395 

3563.63110 

137 

0.77326 

0.18284 

10.63887 

3 

0.96799 

0.23951 

3299.28149 

138 

0.77326 

0.18284 

10.63887 

4 

0.96799 

0.23951 

3299.28149 

139 

0.77326 

0.18284 

10.63887 

5 

0.96799 

0.23951 

3299.28149 

140 

0.77326 

0.18284 

10.63887 

6 

0.96799 

0.23951 

3299.28149 

7 

0.97868 

0.22983 

2503.21655 

1141 

0.42730 

0.20502 

8.90305 

8 

0.97868 

0.22983 

2503.21655 

1142 

0.42730 

0.20502 

8.90305 

9 

0.96878 

0.22243 

1885.28088 

1143 

0.42174 

0.20494 

8.68957 

10 

1.01867 

0.21679 

1714.54456 

1144 

0.42174 

0.20494 

8.68957 

1145 

0.42174 

0.20494 

8.68957 

21 

1.08576 

0.19086 

563.56506 

1146 

0.42174 

0.20494 

8.68957 

22 

1.06243 

0.19165 

536.47919 

1147 

0.42174 

0.20494 

8.68957 

23 

1.06243 

0.19165 

536.47919 

1148 

0.42174 

0.20494 

8.68957 

24 

1.06559 

0.18244 

254.36528 

1149 

0.42174 

0.20494 

8.68957 

25 

1.06559 

0.18244 

254.36528 

1150 

0.43579 

0.20323 

8.65683 

26 

1.06559 

0.18244 

254.36528 

27 

1.06559 

0.18244 

254.36528 

9991 

0.66217 

0.19189 

12.43171 

28 

1.06559 

0.18244 

254.36528 

9992 

0.62210 

0.19118 

8.52254 

29 

1.04862 

0.17702 

118.84048 

9993 

0.62210 

0.19118 

8.52254 

30 

1.04862 

0.17702 

118.84048 

9994 

0.62210 

0.19118 

8.52254 

9995 

0.62210 

0.19118 

8.52254 

131 

0.84436 

0.17885 

13.11242 

9996 

0.62210 

0.19118 

8.52254 

132 

0.84436 

0.17885 

13.11242 

9997 

0.62210 

0.19118 

8.52254 

133 

0.84436 

0.17885 

13.11242 

9998 

0.62210 

0.19118 

8.52254 

134 

0.80627 

0.18064 

11.47313 

9999 

0.64059 

0.18879 

11.11325 

135 

0.80627 

0.18064 

11.47313 

10,000 

0.64059 

0.18879 

11.11325 

Fig.  16.2  Sample  distribution  function  for  parameters  a  and  b ,  constructed  using  a  histogram  plot 
of  100,000  samples  of  a  MCMC  ran  with  the  same  parameters  as  Fig.  16.1 
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16.4  The  Gibbs  Sampler 

The  Gibbs  sampler  is  another  algorithm  that  creates  a  Markov  chain  having  as 
stationary  distribution  the  posterior  distribution  of  the  parameters.  This  algorithm 
is  based  on  the  availability  of  the  full  conditional  distribution ,  defined  as 


jii(6i)  =  n(di\6j,j  ±  0  (16.11) 

The  full  conditional  distribution  is  the  (posterior)  distribution  of  a  given  parameter, 
given  that  the  values  of  all  other  parameters  are  known.  If  the  full  conditional 
distributions  are  known  and  can  be  sampled  from,  then  a  simple  algorithm  can  be 
implemented: 

1.  Start  the  chain  at  a  given  value  of  the  parameters,  Oo  =  (6q, ... ,  Of). 

2.  Obtain  a  new  value  in  the  chain  through  successive  generations: 

6 1  drawn  from  ti(6\  \0q  ,6q  , . . .) 

0\  drawn  from  71(62 1  Ol ,  Oq  , . . .) 

0[n  drawn  from  7i(0m\0\ ,  0\, . . . ,  6f~l) 

3.  Iterate  until  convergence  to  stationary  distribution  is  reached. 

The  justification  of  this  method  can  be  found  in  [15].  In  the  case  of  data  fitting 
with  a  dataset  Z  and  a  model  with  m  adjustable  parameters,  usually  it  is  not  possible 
to  know  the  full  conditional  distributions,  thus  this  method  is  not  as  common  as 
the  Metropolis-Hastings  algorithm.  The  great  advantage  of  the  Gibbs  sampler  is  the 
fact  that  the  acceptance  is  100%,  since  there  is  no  rejection  of  candidates  for  the 
Markov  chain,  unlike  the  case  of  the  Metropolis-Hastings  algorithm. 

Example  16.2  This  example  reproduces  an  application  presented  by  Carlin  et  al. 
[8],  and  illustrates  a  possible  application  in  which  the  knowledge  of  the  full 
conditional  distribution  results  in  the  possibility  of  implementing  a  Gibbs  sampler. 

Consider  the  case  in  which  a  Poisson  dataset  of  n  numbers,  yt,  i  —  1 , . . . ,  n,  is  fit 
to  a  step-function  model: 


y  =  A  if  i  <  m 
y  =  pi  if  i  >m 


(16.12) 


The  model  therefore  has  three  parameters,  the  values  A,  /x,  and  the  point  of 
discontinuity,  m.  This  situation  could  be  a  set  of  measurements  of  a  quantity  that 
may  suddenly  change  its  value  at  an  unknown  time,  say  the  voltage  in  a  given  portion 
of  an  electric  circuit  after  a  switch  has  been  opened  or  closed. 
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Assume  that  the  priors  on  the  parameters  are,  respectively,  a  gamma  distributions 
for  A  and  p,  p(X)  =  G(a ,  /3)  and  p(fi)  =  G(y,  8),  and  a  uniform  distribution  for 
m,  p(m)  —  l/n  (see  Sect.  7.2  for  definition  of  the  gamma  distribution).  According 
to  Bayes’  theorem,  the  posterior  distribution  is  proportional  to  the  product  of  the 
likelihood  and  the  priors: 

it  (A ,  /x,m)  oc  P(y  i, . . .  ,yn/ A,  /x,m)  •  p(X)p(p)p(m).  (16.13) 

The  posterior  is  therefore  given  by 
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The  equation  above  indicates  that  the  conditional  posteriors,  obtained  by  fixing  all 
parameters  except  one,  are  given  by 
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7ix(X)  —  G[a  +  P  +  m 

i=  1 
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This  is  therefore  an  example  of  a  case  where  the  conditional  posterior  distributions 
are  known,  and  therefore  the  Gibbs  algorithm  is  applicable.  The  only  complication 
is  the  simulation  of  the  three  conditional  distributions,  which  can  be  achieved  using 
the  methods  described  in  Sect.  4.8.  <> 


16.5  Tests  of  Convergence 

It  is  necessary  to  test  that  the  MCMC  has  reached  convergence  to  the  stationary 
distribution  before  inference  on  the  posterior  distribution  can  be  made.  Convergence 
indicates  that  the  chain  has  started  to  sample  the  posterior  distribution,  so  that  the 
MCMC  samples  are  representative  of  the  distribution  of  interest,  and  are  not  biased 
by  such  choices  as  the  starting  point  of  the  chain.  The  period  of  time  required  for 
the  chain  to  reach  convergence  goes  under  the  name  of  burn-in  period,  and  varies 
from  chain  to  chain  according  to  a  variety  of  factors,  such  as  the  choice  of  prior  and 
proposal  distributions.  We  therefore  must  identify  and  remove  such  initial  period 
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from  the  chain  prior  to  further  analysis.  The  Geweke  z- score  test  and  the  Gelman- 
Rubin  test  are  two  of  the  most  common  tests  used  to  identify  the  burn-in  period. 

Another  important  consideration  is  that  the  chain  must  be  run  for  a  sufficient 
number  of  iterations,  so  that  the  sample  distribution  becomes  a  good  approximation 
of  the  true  posterior  distribution.  It  is  clear  that  the  larger  the  number  of  iterations 
after  the  burn-in  period,  the  more  accurate  will  be  the  estimates  of  the  parameters  of 
the  posterior  distribution.  In  practice  it  is  convenient  to  know  the  minimum  stopping 
time  that  enables  to  estimate  the  posterior  distribution  with  the  required  precision. 
The  Raftery  -Lewis  test  is  designed  to  give  an  approximate  estimate  of  both  the  burn- 
in  time  and  the  minimum  required  stopping  time. 

Typical  considerations  concerning  the  burn-in  period  and  the  stopping  time  of 
a  chain  can  be  illustrated  with  the  example  of  three  chains  based  on  the  data  from 
Table  10.1.  The  chains  were  run,  respectively,  with  a  uniform  proposal  distribution 
of  1,  10,  and  100  for  both  parameters  of  the  linear  model,  starting  at  the  same  point 
(Figs.  16.3,  16.4  and  16.5).  The  chain  with  a  narrower  proposal  distribution  requires 
a  longer  time  to  reach  the  stationary  value  of  the  parameters,  in  part  because  at 
each  time  interval  the  candidate  can  be  chosen  in  just  a  limited  neighborhood  of  the 


2000  4000  6000  8000 


Iteration 


Parameter  a  Parameter  b 


Fig.  16.3  MCMC  for  parameters  a ,  b  of  linear  model  fit  to  the  data  in  Table  10.1,  using  a  uniform 
proposal  distribution  with  width  of  1  for  both  parameters.  The  chain  started  at  a  =  12  and  b  =  6. 
In  grey  is  the  sample  distribution  obtained  by  removing  the  initial  2000  iterations,  the  ones  that  are 
most  affected  by  the  arbitrary  choice  of  starting  point 
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Fig.  16.4  MCMC  for  parameters  a,  b  of  linear  model  fit  to  the  data  in  Table  10.1,  using  a  uniform 
proposal  distribution  with  width  of  10  for  both  parameters.  The  chain  started  at  same  values  as  in 
Fig.  16.3 


previous  link.  Moreover,  the  sampling  of  parameter  space  is  less  uniform  because 
the  chain  requires  longer  time  to  span  the  entire  parameter  range.  The  intermediate 
value  for  the  proposal  distribution  results  in  an  almost  immediate  convergence,  and 
the  sampling  of  parameter  space  is  clearly  more  uniform.  An  increase  in  the  size 
of  the  proposal  distribution,  however,  may  eventually  lead  to  slow  convergence  and 
poor  sampling,  as  indicated  by  the  chain  with  the  largest  value  of  the  proposal  width. 
In  this  case,  candidates  are  drawn  from  regions  of  parameter  space  that  have  very 
low  likelihood,  or  large  /2,  and  therefore  the  chain  has  a  tendency  to  remain  at  the 
same  location  for  extended  periods  of  time,  with  low  acceptance  rate.  The  result 
is  a  chain  with  poor  coverage  of  parameter  space  and  poorly  determined  sample 
distribution  for  their  parameters.  A  smoother  distribution  is  preferable,  because  it 
leads  to  a  more  accurate  determination  of  the  median,  and  of  confidence  ranges  on 
the  parameters. 

Another  consideration  is  that  elements  in  the  chain  are  more  or  less  correlated 
to  one  another,  according  to  the  choice  of  the  proposal  distribution,  and  other 
choices  in  the  construction  of  the  chain.  Links  in  the  chains  are  always  correlated 
by  construction,  since  the  next  link  in  the  chain  typically  depends  on  the  current 
state  of  the  chain.  In  principle  a  Markov  chain  can  be  constructed  that  does  not 
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Fig.  16.5  MCMC  for  parameters  a,  b  of  linear  model  fit  to  the  data  in  Table  10.1,  using  a  uniform 
proposal  distribution  with  width  of  50  for  both  parameters.  The  chain  started  at  same  values  as  in 
Fig.  16.3 


depend  on  the  current  state  of  the  chain,  but  in  most  cases  it  is  convenient  to  make 
full  use  of  the  Markovian  property  that  allows  to  make  use  of  the  current  state  of 
the  chain.  The  chains  in  Figs.  16.3,  16.4  and  16.5  illustrate  the  fact  that  the  degree 
of  correlation  varies  with  the  proposal  distribution  choice.  For  example,  the  chain 
with  the  narrowest  proposal  distribution  appears  more  correlated  than  that  with  the 
intermediate  choice  for  the  width;  also,  the  chain  with  the  largest  width  appears  to 
have  periods  with  the  highest  degree  of  correlation,  when  the  chain  does  not  move 
for  hundreds  of  iterations.  This  shows  that  the  degree  of  correlation  is  a  nonlinear 
function  of  the  proposal  distribution  width,  and  that  fine-tuning  is  always  required 
to  obtain  a  chain  with  good  mixing  properties.  The  degree  of  correlation  among 
elements  of  the  chain  will  become  important  when  we  desire  to  estimate  the  variance 
of  the  mean  from  a  specific  segment  of  the  chain,  since  the  formulas  derived  earlier 
in  Chap.  4  apply  only  to  independent  samples. 

Testing  for  convergence  and  stopping  time  of  the  chain  are  critical  tasks  for 
a  Monte  Carlo  Markov  chain.  The  tests  discussed  below  are  some  of  the  more 
common  analytic  tools  and  can  be  implemented  with  relative  ease. 


16.5  Tests  of  Convergence 


263 


16.5.1  The  Geweke  Z  Score  Test 


A  simple  test  of  convergence  is  provided  by  the  difference  of  the  mean  of  two 
segments  of  the  chain.  Under  the  null  hypothesis  that  the  chain  is  sampling  the 
same  distribution  during  both  segments,  the  sample  means  are  expected  to  be  drawn 
from  the  same  parent  mean.  Consider  segment  A  at  the  beginning  of  the  chain,  and 
segment  B  at  the  end  of  the  chain;  for  simplicity,  consider  one  parameter  x/f  at  a  time. 
If  the  chain  is  of  length  TV,  the  prescription  is  to  use  an  initial  segment  of  Na  —  0.17V 
elements,  and  a  final  segment  with  NB  =  0.57V  links,  although  those  choices  are 
somewhat  arbitrary,  and  segments  of  different  length  can  also  be  used. 

The  mean  of  each  parameter  in  the  two  segments  A  and  B  is  calculated  as 


( 


< 


V 


1  N 

—  J2 

Nb  j=n-Nb+  1 


(16.15) 


To  compare  the  two  sample  means,  it  is  also  necessary  to  estimate  their  sample 
variances  cr2_  and  cr2_.  This  task  is  complicated  significantly  by  the  fact  that  one 

cannot  just  use  (2.11),  because  of  the  correlation  between  links  of  the  chain.  One 
possibility  to  overcome  this  difficulty  is  to  thin  the  chain  by  using  only  every  n- th 
iteration,  so  that  the  thinned  chain  better  approximates  independent  samples. 

The  test  statistic  is  the  Z  score  of  the  difference  between  the  means  of  the  two 
segments: 


f B 


(16.16) 


Under  the  assumption  that  the  two  means  follow  the  same  distribution  and  that  they 
are  uncorrelated,  the  Z-score  is  distributed  as  a  standard  Gaussian,  ZG  ~  7V(0,  1). 
For  this  reason  the  two  segments  of  the  chain  are  typically  separated  by  a  large 
number  of  iterations.  An  application  of  the  Geweke  Z  score  is  to  step  the  start  of 
segment  A  forward  in  time,  until  the  ZG  scores  don’t  exceed  approximately  ±3, 
which  correspond  to  a  ± 3a  deviation  in  the  means  of  the  two  segments.  The  burn- 
in  period  that  needs  to  be  excised  is  that  before  the  Z  scores  stabilize  around  the 
expected  values.  An  example  of  the  use  of  this  test  statistic  is  provided  in  Fig.  16.6, 
in  which  ZG  was  calculated  from  the  chain  with  proposal  width  10.  An  initial 
segment  of  length  20  %  of  the  total  chain  length  is  compared  to  the  final  40  %  of 
the  chain,  by  stepping  the  beginning  of  the  initial  segment  until  it  overlaps  with 
the  final  segment.  By  using  all  links  in  the  chain  to  estimate  the  variance  of  the 
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Fig.  16.6  Geweke  Z  scores  with  segment  A  and  segment  B ,  respectively,  20  and  40  %  of  the  total 
chain  length.  The  results  correspond  to  the  chain  run  with  a  proposal  width  of  10.  The  Z  scores  are 
calculated  by  using  only  every  other  10-th  iteration 


mean,  the  variance  would  be  underestimated  because  of  the  correlation  among  links, 
leading  to  erroneously  large  values  of  Zq-  If  the  chain  is  thinned  by  a  factor  of  10, 
then  the  estimate  of  the  variance  using  (2.11)  is  more  accurate,  and  the  resulting  Z 
scores  show  that  the  chains  converge  nearly  immediately,  as  is  also  clear  by  a  visual 
inspection  of  the  chain  from  Fig.  16.4. 

The  effect  of  the  starting  point  in  the  evaluation  of  the  burn-in  period  is  shown  in 
Fig.  16.3,  in  which  it  is  apparent  that  it  takes  about  2000  iterations  for  the  chain  to 
forget  the  initial  position,  and  to  start  sampling  the  posterior  distribution,  centered 
at  the  dashed  lines.  A  larger  proposal  distribution,  as  in  Fig.  16.4,  makes  it  easier 
to  reach  the  posterior  distribution  more  rapidly,  to  the  point  that  no  burn-in  period 
is  visible  in  this  case.  In  the  presence  of  a  burn-in  period,  the  sample  distribution 
must  be  constructed  by  excising  the  initial  portion  of  the  chain,  as  shown  in  the  grey 
histogram  plot  of  Fig.  16.3. 


16.5.2  The  Gelman-Rubin  Test 

The  Gelman-Rubin  test  investigates  the  effect  of  initial  conditions  on  the  conver¬ 
gence  properties  of  the  MCMC  and  makes  use  of  m  parallel  chains  starting  from 
different  initial  points.  Initially,  the  m  chain  will  be  far  apart  because  of  the  different 
starting  points.  As  the  chains  start  sampling  the  stationary  distribution,  they  will 
have  the  same  statistical  properties. 
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The  test  is  based  on  two  estimates  of  the  variance,  or  variability,  of  the  chains:  the 
within-chain  variance  for  each  of  the  m  chains  W,  and  the  between-chain  variance  B. 
At  the  beginning  of  the  chain,  W  will  underestimate  the  true  variance  of  the  model 
parameters,  because  the  chains  have  not  had  time  to  sample  all  possible  values.  On 
the  other  hand,  B  will  initially  overestimate  the  variance,  because  of  the  different 
starting  points.  The  test  devised  by  Gelman  and  Rubin  [17]  defines  the  ratio  of 
the  within-to-between  variance  as  a  test  to  measure  convergence  of  the  chains,  to 
identify  an  initial  burn-in  period  that  should  be  removed  because  of  the  lingering 
effect  of  initial  conditions. 

For  each  parameter,  consider  m  chains  of  N  iterations  each,  where  \j/i  is  the  mean 
of  each  chain  i  —  l, ...  ,m  and  xjr  the  mean  of  the  means: 


1  N 

1  (16.17) 
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The  between-chain  variance  B  is  defined  as  the  average  of  the  variances  of  the  m 
chains, 
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Notice  that,  in  (16.18),  B/N  is  the  variance  of  the  means  i/^-.  The  within-chain 
variance  W  is  defined  by 
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The  quantity  ,  defined  as 
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(16.19) 


(16.20) 


is  intended  to  be  an  unbiased  estimator  of  the  variance  of  the  parameter  x/r  under  the 
hypothesis  that  the  stationary  distribution  is  being  sampled.  At  the  beginning  of  a 

/V 

chain — before  the  stationary  distribution  is  reached — overestimates  the  variance, 
because  of  the  different  initial  starting  points.  It  was  suggested  by  Brooks  and 
Gelman  [6]  to  add  an  additional  term  to  this  estimate  of  the  variance,  to  account 
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for  the  variability  in  the  estimate  of  the  means,  so  that  the  estimate  of  the  within- 
chain  variance  to  use  becomes 


+ 


B 

mN 


Convergence  can  be  monitored  by  use  of  the  following  statistic: 


(16.21) 


(16.22) 


which  should  converge  to  1  when  the  stationary  distribution  in  all  chains  has  been 
reached.  A  common  use  of  this  statistic  is  to  repeat  the  calculation  of  the  Gelman- 
Rubin  statistic  after  excising  an  increasingly  longer  initial  portion  of  the  chain,  until 
approximately 


VI  <  1.2.  (16.23) 

A  procedure  to  test  for  convergence  of  the  chain  using  the  Gelman-Rubin 
statistic  is  to  divide  the  chain  into  segments  of  length  b ,  such  that  the  N  iterations 
are  divided  into  N/b  batches.  For  each  segment  starting  at  iteration  i  —  k  x  b,k  — 
0, . . . ,  N/b  —  1,  we  can  calculate  the  value  R  and  claim  convergence  of  the  chains 
when  (16.23)  is  satisfied.  Figure  16.7  shows  results  of  this  test  run  on  m  —  2  chains 


Fig.  16.7  Gelman-Rubin  statistic  R  calculated  from  m  =  2  chains  with  the  same  distributions  as 
in  Fig.  16.3,  one  starting  at  a  =  12,  b  =  6,  and  the  other  at  a  =  300  and  b  =  300.  The  chain 
rapidly  converges  to  its  stationary  distribution,  and  appears  to  forget  about  the  starting  point  after 

/V 

approximately  500  iterations.  The  values  of  R  were  calculated  in  segments  of  length  b  =  200, 
starting  at  iteration  i  =  0 
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Fig.  16.8  Plot  of  the  average  of  the  parameters  a  and  b  for  the  two  chains  used  in  Fig.  16.7  ( top 
lines  are  for  parameter  a ,  bottom  lines  for  b).  For  both  parameters,  the  two  chains  sample  the  same 
posterior  distribution  after  approximately  500  iterations 


based  on  the  data  of  Table  10.1,  starting  at  different  values:  one  chain  starting  at 
a  value  that  is  close  to  the  posterior  mean  of  the  parameters,  and  one  starting  at 
values  that  were  intentionally  chosen  to  be  much  larger  than  the  parent  values.  After 
approximately  500  or  so  iterations,  the  within-chain  and  between-chain  estimates  of 

/V 

the  variance  become  comparable,  and  the  value  of  R  approaches  the  value  of  one. 

Another  related  tool  that  aids  in  assessing  convergence  is  the  plot  of  the  mean 
of  the  parameters,  shown  in  Fig.  16.8:  it  takes  approximately  500  iterations  for  the 
chain  starting  with  high  initial  values,  to  begin  sampling  the  stationary  distribution. 
It  is  clear  that,  from  that  point  on,  both  chains  hover  around  similar  values  of  the 

/V 

parameters.  One  should  also  check  that,  individually,  both  V  and  W  also  stabilize  to 
a  common  value  as  function  of  starting  point  of  the  batch,  and  not  just  their  ratio  R. 
In  fact,  under  the  hypothesis  of  convergence,  both  within-chain  and  between-chain 
variances  should  converge  to  a  common  value. 

Similar  procedures  to  monitor  convergence  using  the  Gelman-Rubin  may  instead 
use  batches  of  increasing  length,  starting  from  one  of  length  b ,  and  increasing  to  2b , 
etc.,  optionally  discarding  the  first  half  of  each  batch.  Moreover,  thinning  can  be 
implemented  when  calculating  means,  to  reduce  the  effect  of  correlation  among  the 
samples.  In  all  cases,  the  goal  is  to  show  that  eventually  the  value  of  R  stabilizes 
around  unity. 


16.5.3  The  Raftery-Lewis  Test 

An  ideal  test  for  the  convergence  of  MCMCs  is  one  that  determines  the  length  of  the 
burn-in  period,  and  how  long  should  the  chain  be  run  to  achieve  a  given  precision  in 
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the  estimate  of  the  model  parameters.  The  Raftery-Lewis  test  provides  estimates  of 
both  quantities,  based  on  just  a  short  test  run  of  the  chain.  The  test  was  developed  by 
Raftery  and  Lewis  [37],  and  it  uses  the  comparison  of  the  short  sample  chain  with  an 
uncorrelated  chain  to  make  inferences  on  the  convergence  properties  of  the  chain. 
In  this  section  we  describe  the  application  of  the  test  and  refer  the  reader  interested 
in  its  justification  to  [37]. 

The  starting  point  to  use  the  Raftery-Lewis  test  is  to  determine  what  inferences 
we  want  to  make  from  the  Markov  chain.  Typically  we  want  to  estimate  confidence 
intervals  at  a  given  significance  for  each  parameter,  which  means  we  need  to 
estimate  two  values  0min  and  9max  for  each  parameter  9  such  that  their  interval 
contains  a  probability  1  —  q  (e.g.,  respectively,  q  —  0.32,  0.10  or  0.01  for  confidence 
level  68,  90  or  99  %), 

f*  ^ max 

1  -  q  —  /  7t(d)dd. 

”  @min 

Consider,  for  example,  the  case  of  a  90  %  confidence  interval:  the  two  parameter 
values  6 min  and  9max  are  respectively  the  q  —  0.95  and  the  q  —  0.05  quantiles,  so 
that  the  interval  (9min,  0max)  will  contain  90%  of  the  posterior  probability  for  that 
parameter. 

One  can  think  of  each  quantile  as  a  statistic,  meaning  that  we  can  only 

/V  /V 

approximately  estimate  their  values  0min  and  9max.  The  Raftery-Lewis  test  lets  us 

/V  A 

estimate  any  quantile  9q  such  that  it  satisfies  P{9  <  9 q)  —  1  —  q  to  within  d=r,  with 
probability  s  (say  95  %  probability,  s  =  0.95).  We  have  therefore  introduced  two 
additional  probabilities,  r  and  s,  which  should  not  be  confused  with  the  quantile  q . 
Consider,  for  example,  that  the  requirement  is  to  estimate  the  q  —  0.05  quantile, 
with  a  precision  of  r  =  0.01  and  a  probability  of  achieving  this  precision  of 
s  =  0.95.  This  corresponds  to  accepting  that  the  90  %  confidence  interval  resulting 
from  such  estimate  of  the  q  —  0.05  quantile  (and  of  the  q  —  0.95  as  well)  may  in 
reality  be  a  88  %  or  a  92  %  confidence  interval,  95  %  of  the  time. 

The  Raftery-Lewis  test  uses  the  information  provided  by  the  sample  chain, 
together  with  the  desired  quantile  q  and  the  tolerances  r  and  s,  and  returns  the 
number  of  burn-in  iterations,  and  the  required  number  of  iterations  N.  A  justification 
for  this  test  can  be  found  in  [37],  and  the  test  can  be  simply  run  using  widely 
available  software  such  as  the  gibbsit  code  or  the  the  CODA  software  [28,  34].  Note 
that  the  required  number  of  iterations  are  a  function  of  the  quantile  to  be  estimated, 
with  estimation  of  smaller  quantiles  typically  requiring  longer  iterations. 
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Summary  of  Key  Concepts  for  this  Chapter 

□  Monte  Carlo  Markov  chain  (MCMC):  A  numerical  method  to  implement 
a  Markov  chain,  with  the  goal  of  estimating  the  posterior  distribution  of 
model  parameters  via 

P(6/Z)  cx  P(6)C. 

□  Metropolis-Hastings  algorithm :  A  commonly  used  method  to  draw  and 
accept  or  reject  candidates  for  the  MCMC.  It  is  based  on  an  acceptance 
probability  that  simplifies  to  a  ratio  of  likelihoods, 

m'm  .  ■  i 

a(6  /9n)  =  min  ^  -^z-,  1 

when  the  priors  and  proposal  distributions  are  uniform. 

□  The  Gibbs  Sampler.  An  alternative  algorithm  to  create  an  MCMC  that 
makes  use  of  the  full  conditional  distribution  of  each  parameter. 

□  Convergence  tests :  Tests  to  ensure  that  the  MCMC  is  sampling  the 
intended  posterior  distribution.  They  typically  require  to  excise  a  burn-in 
time  when  the  MCMC  has  not  yet  reached  the  stationary  distribution. 

□  Geweke  z-score  test :  A  simple  test  of  convergence  that  makes  use  of  z- 
scores  of  two  segments  of  the  chain. 

□  Gelman-Rubin  test :  A  convergence  test  that  requires  multiple  parallel 
chains  and  makes  use  of  between-chain  and  within-chain  variances. 

□  Raftery-Lewis  test :  A  convergence  test  that  compares  a  sample  of  the 
MCMC  to  an  uncorrelated  chain  to  determine  burn-in  time  and  required 
length  of  the  chain. 


Problems 

16.1  Prove  that,  in  the  presence  of  positive  correlation  among  MCMC  samples,  the 
variance  of  the  sample  mean  is  larger  than  that  of  an  independent  chain. 

16.2  Using  the  data  of  logm  and  velocity  from  Table  8.1  of  Hubble’s  experiment, 
construct  a  Monte  Carlo  Markov  chain  for  the  fit  to  a  linear  model  with  10,000 
iterations.  Use  uniform  distributions  for  the  prior  and  proposal  distributions  of  the 
two  model  parameters  a  and  b ,  the  latter  with  widths  of  0.1  and  0.02,  respectively, 
for  a  and  b  in  the  neighborhood  of  the  current  value.  You  can  start  your  chain  at 
values  of  a  —  0.2  and  b  —  0.9.  After  completion  of  the  chain,  plot  the  sample 
distribution  of  the  two  model  parameters. 
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16  Monte  Carlo  Markov  Chains 


16.3  A  one-parameter  chain  is  constructed  such  that  in  two  intervals  A  and  B  the 
following  values  are  accepted  into  the  chain: 


A  :  10, 11, 13, 11, 10 
B  :  7,8, 1, 11, 10,8; 

where  A  is  an  initial  interval,  and  B  an  interval  at  the  end  of  the  chain.  Not  knowing 
how  the  chain  was  constructed,  use  the  Geweke  z  score  to  determine  whether  the 
chain  might  have  converged. 

16.4  Using  the  data  of  Table  10.1,  construct  a  Monte  Carlo  Markov  chain  for  the 
parameters  of  the  linear  model,  with  10,000  iterations.  Use  uniform  distributions  for 
the  prior  and  proposal  distributions,  the  latter  with  a  width  of  10  for  both  parameters. 
Start  the  chain  at  a  —  12  and  b  —  6.  After  completion  of  the  chain,  plot  the  sample 
distribution  of  the  two  model  parameters. 

16.5  Consider  the  following  portions  of  two  one-parameter  chains,  run  in  parallel 
and  starting  from  different  initial  positions: 


7,8, 1, 11, 10,8 
11, 11,8, 10,9, 12. 


Using  two  segments  of  length  b  —  3,  calculate  the  Gelman-Rubin  statistic  v  R  for 
both  segments  under  the  hypothesis  of  uncorrelated  samples. 

16.6  Consider  the  step-function  model  described  in  Example  16.2,  and  a  dataset 
consisting  of  n  measurements.  Assuming  that  the  priors  on  the  parameters  A,  fi  and 
m  are  uniform,  show  that  the  full  conditional  distributions  are  given  by 


m 


7ta(A)  =  g(  J2 yi  +  I.™ 

i=  1 


n 


71  ftifi)  =  Gl  J2  V,  +  1 . «  -  in 

V  i=m+ 1 

e-mX^Y7=\ yi  e-{n-m)ii ^E/l=m+l  yi 


(16.24) 


Km(m)  = 


e  ^ A ^2=  i Xi e  (n  jjYl,i—i+\yi 


where  G  represents  the  gamma  distribution. 

16.7  Consider  the  step-function  model  described  in  Example  16.2,  and  a  dataset 
consisting  of  the  following  five  measurements: 


0,1, 3, 4, 2. 


Start  a  Metropolis-Hastings  MCMC  at  A  =  0,  /x  =  2  and  m  —  1,  and  use  uniform 
priors  on  all  three  parameters.  Assume  for  simplicity  that  all  parameters  can  only 
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be  integer,  and  use  uniform  proposal  distributions  that  span  the  ranges  AX  —  ±2, 
Afi  —  ± 2  and  Am  —  ±2,  and  that  the  following  numbers  are  drawn  in  the  first 
three  iterations: 


Iteration 

AX 

A/i 

Am 

a 

1 

+  1 

-1 

+  1 

0.5 

2 

+  1 

+2 

+  1 

0.7 

3 

-1 

-2 

+  1 

0.1 

With  this  information,  calculate  the  first  four  links  of  the  Metropolis-Hastings 
MCMC. 

16.8  Consider  a  Monte  Carlo  Markov  chain  constructed  with  a  Metropolis- 
Hastings  algorithm,  using  uniform  prior  and  proposal  distribution.  At  a  given 
iteration,  the  chain  is  at  the  point  of  maximum  likelihood  or,  equivalently,  minimum 
j2 .  Calculate  the  probability  of  acceptance  of  a  candidate  that  has,  respectively, 
A x2  =  1,2,  and  10. 
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A.l  The  Gaussian  Distribution  and  the  Error  Function 


The  Gaussian  distribution  (3.11)  is  defined  as 


/(*)  = 


1 


\l2no1 


e  2o-2 


(A.l) 


The  maximum  value  is  obtained  at  v  =  /x,  and  the  value  of  v  where  the  Gaussian  is 
a  times  the  peak  value  is  given  by 


z  = 


X  —  jl 


(7 


=  V —2  In 


a. 


(A. 2) 


Figure  A.l  shows  a  standard  Gaussian  normalized  to  its  peak  value,  and  values  of  a 
times  the  peak  value  are  tabulated  in  Table  A.l.  The  Half  Width  at  Half  Maximum 
(HWHM)  has  a  value  of  approximately  1.18a. 

The  error  function  is  defined  in  (3.13)  as 


erf  z  — 


=  — f 

\fj(  J  —  7 


e  x  dx 


(A. 3) 


and  it  is  related  to  the  integral  of  the  Gaussian  distribution  defined  in  (3.12), 


Mz) 


pll+ZO 

I 

j  jl—ZO 


f(x)dx  — 


=  —T 

\[7jz  J-z 


_xr_ 

e  2  dx. 


(A.4) 


The  relationship  between  the  two  integrals  is  given  by 


erf 


2  =A(z). 


(A. 5) 
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Appendix:  Numerical  Tables 
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Fig.  A.l  Normalized  values  of  the  probability  distribution  function  of  a  standard  Gaussian  (/x 
and  cr  =  1) 


=  0 


Table  A.l  Values  of  a  times  the  peak  value  for  a  Gaussian  distribution 


a 

z 

a 

z 

a 

z 

a 

z 

a 

z 

0.980 

0.201 

0.960 

0.286 

0.940 

0.352 

0.920 

0.408 

0.900 

0.459 

0.880 

0.506 

0.860 

0.549 

0.840 

0.591 

0.820 

0.630 

0.800 

0.668 

0.780 

0.705 

0.760 

0.741 

0.740 

0.776 

0.720 

0.811 

0.700 

0.845 

0.680 

0.878 

0.660 

0.912 

0.640 

0.945 

0.620 

0.978 

0.600 

1.011 

0.580 

1.044 

0.560 

1.077 

0.540 

1.110 

0.520 

1.144 

0.500 

1.177 

0.480 

1.212 

0.460 

1.246 

0.440 

1.281 

0.420 

1.317 

0.400 

1.354 

0.380 

1.391 

0.360 

1.429 

0.340 

1.469 

0.320 

1.510 

0.300 

1.552 

0.280 

1.596 

0.260 

1.641 

0.240 

1.689 

0.220 

1.740 

0.200 

1.794 

0.180 

1.852 

0.160 

1.914 

0.140 

1.983 

0.120 

2.059 

0.100 

2.146 

0.080 

2.248 

0.060 

2.372 

0.040 

2.537 

0.020 

2.797 

0.010 

3.035 

The  function  A(z)  describes  the  integrated  probability  of  a  Gaussian  distribution 
to  have  values  between  fi  —  zo  and  ji  +  zc>.  The  number  z  therefore  represents  the 
number  of  a  by  which  the  interval  extends  in  each  direction.  The  function  A(z)  is 
tabulated  in  Table  A. 2,  where  each  number  in  the  table  corresponds  to  a  number  z 
given  by  the  number  in  the  left  column  (e.g.,  0.0,  0. 1 ,  etc.),  and  for  which  the  second 
decimal  digit  is  given  by  the  number  in  the  top  column  (e.g.,  the  value  of  0.007979 
corresponds  to  z  —  0.01). 

The  cumulative  distribution  of  a  standard  Gaussian  function  was  defined  in  (3.14) 


Table  A.2  Values  of  the  integral  A(z)  as  a  function  of  z,  the  number  of  standard  errors  a 
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and  it  is  therefore  related  to  the  integral  A(z)  by 


Mz) 

2 


(A. 7) 


The  values  of  B(z )  are  tabulated  in  Table  A.3.  Each  number  in  the  table  corresponds 
to  a  number  z  given  by  the  number  in  the  left  column  (e.g.,  0.0,  0.1,  etc.),  and  for 
which  the  second  decimal  digit  is  given  by  the  number  in  the  top  column  (e.g.,  the 
value  of  0.503990  corresponds  to  z  —  0.01). 

Critical  values  of  the  standard  Gaussian  distribution  functions  corresponding  to 
selected  values  of  the  integrals  A(z)  and  B(z )  are  shown  in  Table  A. 4.  They  indicate 
the  value  of  the  variable  z  required  to  include  a  given  probability,  and  are  useful  for 
either  two-sided  or  one-sided  rejection  regions  in  hypothesis  testing. 


A.2  Upper  and  Lower  Limits  for  a  Poisson  Distribution 


The  Gehrels  approximation  described  in  [16]  can  be  used  to  calculate  upper 
and  lower  limits  for  a  Poisson  distribution,  when  n0bs  counts  are  recorded.  The 
confidence  level  is  described  by  the  parameter  S ,  corresponding  to  the  number  of 
standard  deviations  a  for  a  Gaussian  distribution;  for  example,  S  =  1  corresponds 
to  an  84.1  %  confidence  level,  S  =  2  to  a  97.7  %,  and  S  —  3  corresponds  to  99.9  %; 
see  Table  5.2  for  correspondence  between  values  of  S  and  probability.  The  upper 
and  lower  limits  are  described,  in  the  simplest  approximation,  by 


^up  —  ^obs  T" 


S2  +  3 


^7 o  —  (  1 


+  S \t  Mobs  +  “ 

s  x  3 


9  M0bs  5  M0bs 


(A. 8) 


and  more  accurate  approximations  are  provided  in  [16]  (Tables  A. 5  and  A. 6). 


A.3  The  x2  Distribution 


The  probability  distribution  function  for  a  x2  variable  is  defined  in  (7.11)  as 


A2  (z)  =  hr 


//  2 


1 


r(f/  2) 


e  2Z2  a, 
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Appendix:  Numerical  Tables 


Table  A.4  Table  of  critical  Probability 

values  of  the  standard  — 

Gaussian  distribution  to  - _ 

Two-sided  z 

One-sided  z 

0.013 

-2.326 

include  a  given  probability, 

0.063 

-1.645 

for  two-sided  confidence  0.10 

0.126 

-1.282 

intervals  (-z,  z)  of  the  integral  0.20 

0.253 

-0.842 

A(z),  and  for  one-sided  q 

0.385 

-0.524 

intervals  t~oo,  z)  or  me 

integral  Biz)  i 

0.524 

-0.253 

0.50 

0.674 

-0.000 

0.60 

0.842 

0.253 

0.70 

1.036 

0.524 

0.80 

1.282 

0.842 

0.90 

1.645 

1.282 

0.95 

1.960 

1.645 

0.99 

2.576 

2.326 

0.999 

3.290 

3.090 

0.9999 

3.890 

3.718 

Table  A.5  Selected  upper 
limits  for  a  Poisson  variable 
using  the  Gehrels 
approximation 


Upper  limits 

Poisson  parameter  S  or  confidence  level 


S  =  1 

S  =  2 

S  =  3 

blobs 

(1-cr,  or  84.1  %) 

(2 -or,  or  97.7  %) 

(3 -a,  or  99.9  %) 

0 

1.87 

3.48 

5.60 

1 

3.32 

5.40 

7.97 

2 

4.66 

7.07 

9.97 

3 

5.94 

8.62 

11.81 

4 

7.18 

10.11 

13.54 

5 

8.40 

11.55 

15.19 

6 

9.60 

12.95 

16.79 

7 

10.78 

14.32 

18.35 

8 

11.96 

15.67 

19.87 

9 

13.12 

16.99 

21.37 

10 

14.28 

18.31 

22.84 

20 

25.56 

30.86 

36.67 

30 

36.55 

42.84 

49.64 

40 

47.38 

54.52 

62.15 

50 

58.12 

66.00 

74.37 

60 

68.79 

77.34 

86.38 

70 

79.41 

88.57 

98.23 

80 

89.99 

99.72 

109.96 

90 

100.53 

110.80 

121.58 

100 

111.04 

121.82 

133.11 

A.3  The  x2  Distribution 
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Table  A.6  Selected  lower 
limits  for  a  Poisson  variable 
using  the  Gehrels 
approximation 


Lower  limits 

Poisson  parameter  S  or  confidence  level 

5=1 

5  =  2 

5  =  3 

I^obs 

(1-cr,  or  84.1  %) 

(2-a,  or  97.7%) 

(3-cr,  or  99.9  %) 

1 

0.17 

0.01 

0.00 

2 

0.71 

0.21 

0.03 

3 

1.37 

0.58 

0.17 

4 

2.09 

1.04 

0.42 

5 

2.85 

1.57 

0.75 

6 

3.63 

2.14 

1.13 

7 

4.42 

2.75 

1.56 

8 

5.24 

3.38 

2.02 

9 

6.06 

4.04 

2.52 

10 

6.90 

4.71 

3.04 

20 

15.57 

12.08 

9.16 

30 

24.56 

20.07 

16.16 

40 

33.70 

28.37 

23.63 

50 

42.96 

36.88 

31.40 

60 

52.28 

45.53 

39.38 

70 

61.66 

54.28 

47.52 

80 

71.08 

63.13 

55.79 

90 

80.53 

72.04 

64.17 

100 

90.02 

81.01 

72.63 

where  /  is  the  number  of  degrees  of  freedom.  The  critical  value  or  p-quantile  of  the 
distribution  is  given  by 


r  Xcrit 

Px2  (z  <  Xcrit )  =  /  fx2  0)dZ  =  P  (A-9) 

Jo 

or,  equivalently, 

n  OO 

Py} (z  >  Xcrit)  =  /  7  fx20)dz  =  1  -p.  (A.  10) 

The  critical  value  is  a  function  of  the  number  of  degrees  of  freedom/  and  the  level 
of  probability  p.  Normally  p  is  intended  as  a  large  number,  such  as  0.68,  0.90,  or 
0.99,  meaning  that  there  is  just  a  32,  10,  or  1  %  probability  to  have  values  higher 
than  the  critical  value  /2r-r 

As  described  in  Sect.  7.2,  the  /2  distribution  has  the  following  mean  and 


variance: 


V  =f 

a2  =  2/. 


282 


Appendix:  Numerical  Tables 


It  is  convenient  to  tabulate  the  value  of  reduced  /2,  or  /2n>//,  that  corresponds  to 
a  given  probability  level,  as  function  of  the  number  of  degrees  of  freedom.  Selected 
critical  values  of  the  / 2  distribution  are  reported  in  Table  A. 7.  When  using  this  table, 
remember  to  multiply  the  tabulated  reduced  /2  by  the  number  of  defrees  of  freedom 
/  to  obtain  the  value  of  /2. 

If  Z  is  a  x1' -distributed  variable  with /  degrees  of  freedoms, 

lim 

/->oo 

In  fact,  a  /2  variable  is  obtained  as  the  sum  of  independent  distributions  (Sect.  7.2), 
to  which  the  central  theorem  limit  applies  (Sect.  4.3).  For  a  large  number  of  degrees 
of  freedom,  the  standard  Gaussian  distribution  can  be  used  to  supplement  Table  A. 7 
according  to  (A.  11).  For  example,  for  p  —  0.99,  the  one-sided  critical  value  of  the 
standard  Gaussian  is  approximately  2.326,  according  to  Table  A.4.  Using  this  value 
into  (A.  11)  for /  =  200  would  give  a  critical  value  for  the  /2  distribution  of  1.2326 
(compare  to  1.247  from  Table  A.7).  The  values  off  —  oo  in  Table  A.7  is  obtained 
using  the  Gaussian  approximation,  according  to  (A.l  1). 


z-f 


=  N(  0,  1). 


(A.l  1) 


A.4  The  F  Distribution 


The  F  distribution  with/i  ,/2  degrees  of  freedom  is  defined  in  (7.22)  as 


r 


7l  +/2 


/f(z)  = 


r  if)  r  (f 


/i 

2T2 

Jl 


f±- 1 

Z  2 


1+4 


fi+h 


The  critical  value  Fcrit  that  includes  a  probability  p  is  given  by 


'OO 


P(Z  >  Pen,)  =  f  fF(z)dz  =  l  ~P, 

J  F  crit 


(A.  12) 


and  it  is  a  function  of  the  degrees  of  freedom  f\  and/2.  In  Table  A. 8  are  reported 
the  critical  values  for  various  probability  levels  p ,  for  a  fixed  value  f\  -  1,  and  as 
function  of/2.  Tables  A. 9,  A. 10,  A.ll,  A. 12,  A. 13,  A. 14,  and  A. 15  have  the  critical 
values  as  function  of  both  f\  and/2. 

Asymptotic  values  when/i  and/2  approach  infinity  can  be  found  using  (7.25), 
reported  here  for  convenience: 


lim  fr(z,f]  ,A)  =fxi(x,fi)  where*  —fiz 

f2->oo 

lim  fF(z,fi,f2 )  =fx2(x,f2)  where*  =f2/z. 

/i->oo 


A.4  The  F  Distribution 
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(continued) 


Table  A.7  (continued) 

Probability  p  to  have  a  value  of  reduced  / 2  below  the  critical  value 

/  0.01  0.05  0.10  0.20  0.30  0.40  0.50  0.60  0.70  0.80  0.90  0.95  0.99 

60  0.625  0.720  0.774  0.844  0.897  0.944  0.989  1.036  1.087  1.150  1.240  1.318  1.473 
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Table  A.8  Critical  values  of  F  statistics  for/i  =  1  degrees  of  freedom 


Probability  p  to  have  a  value  of  F  below  the  critical  value 


h 

0.50 

0.60 

0.70 

0.80 

0.90 

0.95 

0.99 

1 

1.000 

1.894 

3.852 

9.472 

39.863 

161.448 

4052.182 

2 

0.667 

1.125 

1.922 

3.556 

8.526 

18.513 

98.503 

3 

0.585 

0.957 

1.562 

2.682 

5.538 

10.128 

34.116 

4 

0.549 

0.885 

1.415 

2.351 

4.545 

7.709 

21.198 

5 

0.528 

0.846 

1.336 

2.178 

4.060 

6.608 

16.258 

6 

0.515 

0.820 

1.286 

2.073 

3.776 

5.987 

13.745 

7 

0.506 

0.803 

1.253 

2.002 

3.589 

5.591 

12.246 

8 

0.499 

0.790 

1.228 

1.951 

3.458 

5.318 

11.259 

9 

0.494 

0.780 

1.209 

1.913 

3.360 

5.117 

10.561 

10 

0.490 

0.773 

1.195 

1.883 

3.285 

4.965 

10.044 

20 

0.472 

0.740 

1.132 

1.757 

2.975 

4.351 

8.096 

30 

0.466 

0.729 

1.112 

1.717 

2.881 

4.171 

7.562 

40 

0.463 

0.724 

1.103 

1.698 

2.835 

4.085 

7.314 

50 

0.462 

0.721 

1.097 

1.687 

2.809 

4.034 

7.171 

60 

0.461 

0.719 

1.093 

1.679 

2.791 

4.001 

7.077 

70 

0.460 

0.717 

1.090 

1.674 

2.779 

3.978 

7.011 

80 

0.459 

0.716 

1.088 

1.670 

2.769 

3.960 

6.963 

90 

0.459 

0.715 

1.087 

1.667 

2.762 

3.947 

6.925 

100 

0.458 

0.714 

1.085 

1.664 

2.756 

3.936 

6.895 

200 

0.457 

0.711 

1.080 

1.653 

2.731 

3.888 

6.763 

00 

0.455 

0.708 

1.074 

1.642 

2.706 

3.842 

6.635 

For  example,  the  critical  values  of  the  F  distribution  for/i  =  1  and  in  the  limit  of 
large/2  are  obtained  from  the  first  row  of  Table  A. 7. 


A.5  The  Student’s  t  Distribution 


The  Student  t  distribution  is  given  by  (7.34), 


1  ray  + 1)/2)  /,  ?2\5</+1) 

M  r(f/2)  x(  +/j 

where  /  is  the  number  of  degrees  of  freedom.  The  probability  p  that  the  absolute 
value  of  a  t  variable  exceeds  a  critical  value  Tcrit  is  given  by 


r(|f|  <  Tcrit)  =  P( \x  ~  n\  <  Tcrit  ■  s 


/Tcrit 

. 

~TCrit 


fT(t)dt  =  1  —p. 


(A.  13) 
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Table  A.9  Critical  values  of  F  statistic  that  include  p  =  0.50  probability 


h 


h 

2 

4 

6 

8 

10 

20 

40 

60 

80 

100 

1 

1.500 

1.823 

1.942 

2.004 

2.042 

2.119 

2.158 

2.172 

2.178 

2.182 

2 

1.000 

1.207 

1.282 

1.321 

1.345 

1.393 

1.418 

1.426 

1.430 

1.433 

3 

0.881 

1.063 

1.129 

1.163 

1.183 

1.225 

1.246 

1.254 

1.257 

1.259 

4 

0.828 

1.000 

1.062 

1.093 

1.113 

1.152 

1.172 

1.178 

1.182 

1.184 

5 

0.799 

0.965 

1.024 

1.055 

1.073 

1.111 

1.130 

1.136 

1.139 

1.141 

6 

0.780 

0.942 

1.000 

1.030 

1.048 

1.084 

1.103 

1.109 

1.113 

1.114 

7 

0.767 

0.926 

0.983 

1.013 

1.030 

1.066 

1.085 

1.091 

1.094 

1.096 

8 

0.757 

0.915 

0.971 

1.000 

1.017 

1.053 

1.071 

1.077 

1.080 

1.082 

9 

0.749 

0.906 

0.962 

0.990 

1.008 

1.043 

1.061 

1.067 

1.070 

1.072 

10 

0.743 

0.899 

0.954 

0.983 

1.000 

1.035 

1.053 

1.059 

1.062 

1.063 

20 

0.718 

0.868 

0.922 

0.950 

0.966 

1.000 

1.017 

1.023 

1.026 

1.027 

30 

0.709 

0.858 

0.912 

0.939 

0.955 

0.989 

1.006 

1.011 

1.014 

1.016 

40 

0.705 

0.854 

0.907 

0.934 

0.950 

0.983 

1.000 

1.006 

1.008 

1.010 

50 

0.703 

0.851 

0.903 

0.930 

0.947 

0.980 

0.997 

1.002 

1.005 

1.007 

60 

0.701 

0.849 

0.901 

0.928 

0.945 

0.978 

0.994 

1.000 

1.003 

1.004 

70 

0.700 

0.847 

0.900 

0.927 

0.943 

0.976 

0.993 

0.998 

1.001 

1.003 

80 

0.699 

0.846 

0.899 

0.926 

0.942 

0.975 

0.992 

0.997 

1.000 

1.002 

90 

0.699 

0.845 

0.898 

0.925 

0.941 

0.974 

0.991 

0.996 

0.999 

1.001 

100 

0.698 

0.845 

0.897 

0.924 

0.940 

0.973 

0.990 

0.996 

0.998 

1.000 

200 

0.696 

0.842 

0.894 

0.921 

0.937 

0.970 

0.987 

0.992 

0.995 

0.997 

oo 

0.693 

0.839 

0.891 

0.918 

0.934 

0.967 

0.983 

0.989 

0.992 

0.993 

These  two-sided  critical  values  are  tabulated  in  Tables  A.  16,  A.  17,  A.  18,  A.  19,  A.20, 
A. 21,  and  A. 22  for  selected  values  of/,  as  function  of  the  critical  value  Tcrit.  In  these 
tables,  the  left  column  indicates  the  value  of  Tcrit  to  the  first  decimal  digit,  and  the 
values  on  the  top  column  are  the  second  decimal  digit. 

Table  A. 23  provides  a  comparison  of  the  probability  p  for  five  critical  values, 
Tcrit  —  1  through  5,  as  function  off.  The  case  off  —  oo  corresponds  to  a  standard 
Gaussian. 


A.6  The  Linear  Correlation  Coefficient  r 

The  linear  correlation  coefficient  is  defined  as 


_ ZAI>02 _ 

OvE^-(EA) /LvMEv,)2) 


(A.  14) 


A.  6  The  Linear  Correlation  Coefficient  r 
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Table  A.  10  Critical  values  of  F  statistic  that  include  p  =  0.60  probability 


h 


h 

2 

4 

6 

8 

10 

20 

40 

60 

80 

100 

1 

2.625 

3.093 

3.266 

3.355 

3.410 

3.522 

3.579 

3.598 

3.608 

3.613 

2 

1.500 

1.718 

1.796 

1.835 

1.859 

1.908 

1.933 

1.941 

1.945 

1.948 

3 

1.263 

1.432 

1.489 

1.518 

1.535 

1.570 

1.588 

1.593 

1.596 

1.598 

4 

1.162 

1.310 

1.359 

1.383 

1.397 

1.425 

1.439 

1.444 

1.446 

1.448 

5 

1.107 

1.243 

1.287 

1.308 

1.320 

1.345 

1.356 

1.360 

1.362 

1.363 

6 

1.072 

1.200 

1.241 

1.260 

1.272 

1.293 

1.303 

1.307 

1.308 

1.309 

7 

1.047 

1.171 

1.209 

1.227 

1.238 

1.257 

1.266 

1.269 

1.270 

1.271 

8 

1.030 

1.150 

1.186 

1.203 

1.213 

1.231 

1.239 

1.241 

1.242 

1.243 

9 

1.016 

1.133 

1.168 

1.185 

1.194 

1.210 

1.217 

1.219 

1.220 

1.221 

10 

1.006 

1.120 

1.154 

1.170 

1.179 

1.194 

1.200 

1.202 

1.203 

1.204 

20 

0.960 

1.064 

1.093 

1.106 

1.112 

1.122 

1.124 

1.124 

1.124 

1.124 

30 

0.945 

1.046 

1.074 

1.085 

1.090 

1.097 

1.097 

1.097 

1.096 

1.096 

40 

0.938 

1.037 

1.064 

1.075 

1.080 

1.085 

1.084 

1.083 

1.082 

1.081 

50 

0.933 

1.032 

1.058 

1.068 

1.073 

1.078 

1.076 

1.074 

1.073 

1.072 

60 

0.930 

1.029 

1.054 

1.064 

1.069 

1.073 

1.070 

1.068 

1.066 

1.065 

70 

0.928 

1.026 

1.052 

1.061 

1.066 

1.069 

1.066 

1.064 

1.062 

1.061 

80 

0.927 

1.024 

1.049 

1.059 

1.064 

1.067 

1.063 

1.060 

1.059 

1.057 

90 

0.926 

1.023 

1.048 

1.057 

1.062 

1.065 

1.061 

1.058 

1.056 

1.054 

100 

0.925 

1.021 

1.047 

1.056 

1.060 

1.063 

1.059 

1.056 

1.054 

1.052 

200 

0.921 

1.016 

1.041 

1.050 

1.054 

1.055 

1.050 

1.046 

1.043 

1.041 

oo 

0.916 

1.011 

1.035 

1.044 

1.047 

1.048 

1.041 

1.036 

1.032 

1.029 

and  it  is  equal  to  the  product  bb ' ,  where  b  is  the  best-fit  slope  of  the  linear  regression 
of  Y  on  X ,  and  br  is  the  slope  of  the  linear  regression  of  X  on  Y.  The  probability 
distribution  function  of  r,  under  the  hypothesis  that  the  variables  X  and  Y  are  not 
correlated,  is  given  by 


(A. 15) 


where  N  is  the  size  of  the  sample,  and  /  =  N—  2  is  the  effective  number  of  degrees 
of  freedom  of  the  dataset. 

In  Table  A. 24  we  report  the  critical  values  of  r  calculated  from  the  following 
equation, 


1  -p  = 


-i 


r crit 


fr(r)dr 


rcrit 


(A.  16) 
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Table  A.  11  Critical  values  of  F  statistic  that  include  p  =  0.70  probability 


h 


h 

2 

4 

6 

8 

10 

20 

40 

60 

80 

100 

1 

5.056 

5.830 

6.117 

6.267 

6.358 

6.544 

6.639 

6.671 

6.687 

6.697 

2 

2.333 

2.561 

2.640 

2.681 

2.705 

2.754 

2.779 

2.787 

2.791 

2.794 

3 

1.847 

1.985 

2.028 

2.048 

2.061 

2.084 

2.096 

2.100 

2.102 

2.103 

4 

1.651 

1.753 

1.781 

1.793 

1.800 

1.812 

1.818 

1.819 

1.820 

1.821 

5 

1.547 

1.629 

1.648 

1.656 

1.659 

1.665 

1.666 

1.666 

1.667 

1.667 

6 

1.481 

1.551 

1.565 

1.570 

1.571 

1.572 

1.570 

1.570 

1.569 

1.569 

7 

1.437 

1.499 

1.509 

1.511 

1.511 

1.507 

1.504 

1.502 

1.501 

1.501 

8 

1.405 

1.460 

1.467 

1.468 

1.466 

1.460 

1.455 

1.452 

1.451 

1.450 

9 

1.380 

1.431 

1.436 

1.435 

1.433 

1.424 

1.417 

1.414 

1.413 

1.412 

10 

1.361 

1.408 

1.412 

1.409 

1.406 

1.395 

1.387 

1.384 

1.382 

1.381 

20 

1.279 

1.311 

1.305 

1.297 

1.290 

1.268 

1.252 

1.245 

1.242 

1.240 

30 

1.254 

1.280 

1.271 

1.261 

1.253 

1.226 

1.206 

1.197 

1.192 

1.189 

40 

1.241 

1.264 

1.255 

1.243 

1.234 

1.205 

1.182 

1.172 

1.167 

1.163 

50 

1.233 

1.255 

1.245 

1.233 

1.223 

1.192 

1.167 

1.156 

1.150 

1.146 

60 

1.228 

1.249 

1.238 

1.226 

1.215 

1.183 

1.157 

1.146 

1.139 

1.135 

70 

1.225 

1.245 

1.233 

1.221 

1.210 

1.177 

1.150 

1.138 

1.131 

1.127 

80 

1.222 

1.242 

1.230 

1.217 

1.206 

1.172 

1.144 

1.132 

1.125 

1.120 

90 

1.220 

1.239 

1.227 

1.214 

1.203 

1.168 

1.140 

1.127 

1.120 

1.115 

100 

1.219 

1.237 

1.225 

1.212 

1.200 

1.165 

1.137 

1.123 

1.116 

1.111 

200 

1.211 

1.228 

1.215 

1.201 

1.189 

1.152 

1.121 

1.106 

1.097 

1.091 

oo 

1.204 

1.220 

1.205 

1.191 

1.178 

1.139 

1.104 

1.087 

1.076 

1.069 

where  p  is  the  probability  for  a  given  value  of  the  correlation  coefficient  to  exceed,  in 
absolute  value,  the  critical  value  rcrit.  The  critical  values  are  function  of  the  number 
of  degrees  of  freedom,  and  of  the  probability  p. 

To  evaluate  the  probability  distribution  function  in  the  case  of  large  /,  a 
convenient  approximation  can  be  given  using  the  asymptotic  expansion  for  the 
Gamma  function  (see  [1]): 

r(az  +  b)  ~  V2 ne~az(az)az+b~l/2.  (A.17) 

For  large  values  of  /,  the  ratio  of  the  Gamma  functions  can  therefore  be  approxi¬ 
mated  as 


r 


7 
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Table  A.  12  Critical  values  of  F  statistic  that  include  p  =  0.80  probability 


/i 


fl 

2 

4 

6 

8 

10 

20 

40 

60 

80 

100 

1 

12.000 

13.644 

14.258 

14.577 

14.772 

15.171 

15.374 

15.442 

15.477 

15.497 

2 

4.000 

4.236 

4.317 

4.358 

4.382 

4.432 

4.456 

4.465 

4.469 

4.471 

3 

2.886 

2.956 

2.971 

2.976 

2.979 

2.983 

2.984 

2.984 

2.984 

2.984 

4 

2.472 

2.483 

2.473 

2.465 

2.460 

2.445 

2.436 

2.433 

2.431 

2.430 

5 

2.259 

2.240 

2.217 

2.202 

2.191 

2.166 

2.151 

2.146 

2.143 

2.141 

6 

2.130 

2.092 

2.062 

2.042 

2.028 

1.995 

1.976 

1.969 

1.965 

1.963 

7 

2.043 

1.994 

1.957 

1.934 

1.918 

1.879 

1.857 

1.849 

1.844 

1.842 

8 

1.981 

1.923 

1.883 

1.856 

1.838 

1.796 

1.770 

1.761 

1.756 

1.753 

9 

1.935 

1.870 

1.826 

1.798 

1.778 

1.732 

1.704 

1.694 

1.689 

1.686 

10 

1.899 

1.829 

1.782 

1.752 

1.732 

1.682 

1.653 

1.642 

1.636 

1.633 

20 

1.746 

1.654 

1.596 

1.558 

1.531 

1.466 

1.424 

1.408 

1.399 

1.394 

30 

1.699 

1.600 

1.538 

1.497 

1.468 

1.395 

1.347 

1.328 

1.318 

1.312 

40 

1.676 

1.574 

1.509 

1.467 

1.437 

1.360 

1.308 

1.287 

1.276 

1.269 

50 

1.662 

1.558 

1.492 

1.449 

1.418 

1.338 

1.284 

1.262 

1.249 

1.241 

60 

1.653 

1.548 

1.481 

1.437 

1.406 

1.324 

1.268 

1.244 

1.231 

1.223 

70 

1.647 

1.540 

1.473 

1.429 

1.397 

1.314 

1.256 

1.231 

1.218 

1.209 

80 

1.642 

1.535 

1.467 

1.422 

1.390 

1.306 

1.247 

1.222 

1.208 

1.199 

90 

1.639 

1.531 

1.463 

1.418 

1.385 

1.300 

1.240 

1.214 

1.200 

1.191 

100 

1.636 

1.527 

1.459 

1.414 

1.381 

1.296 

1.234 

1.208 

1.193 

1.184 

200 

1.622 

1.512 

1.443 

1.396 

1.363 

1.274 

1.209 

1.180 

1.163 

1.152 

oo 

1.609 

1.497 

1.426 

1.379 

1.344 

1.252 

1.182 

1.150 

1.130 

1.117 

A.7  The  Kolmogorov-Smirnov  Test 

The  one-sample  Kolmogorov-Smirnov  statistic  Z>\  is  defined  in  (13.7)  as 

Dn  =  max  | Fn(x)  —  F(x)\, 

X 

where  F(x)  is  the  parent  distribution,  and  FN(x )  the  sample  distribution. 
The  cumulative  distribution  of  the  test  statistic  can  be  approximated  by 

P(Dn  <z/(Vn  +  0.12  +  0.11/ Vn))  ~  <p(z). 

where 

oo 

<p(z)  =  E  (-i  ye~2rhl. 


r=—o o 
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Table  A.  13  Critical  values  of  F  statistic  that  include  p  =  0.90  probability 


h 


h 

2 

4 

6 

8 

10 

20 

40 

60 

80 

100 

1 

49.500 

55.833 

58.204 

59.439 

60.195 

61.740 

62.529 

62.794 

62.927 

63.007 

2 

9.000 

9.243 

9.326 

9.367 

9.392 

9.441 

9.466 

9.475 

9.479 

9.481 

3 

5.462 

5.343 

5.285 

5.252 

5.230 

5.184 

5.160 

5.151 

5.147 

5.144 

4 

4.325 

4.107 

4.010 

3.955 

3.920 

3.844 

3.804 

3.790 

3.782 

3.778 

5 

3.780 

3.520 

3.404 

3.339 

3.297 

3.207 

3.157 

3.140 

3.132 

3.126 

6 

3.463 

3.181 

3.055 

2.983 

2.937 

2.836 

2.781 

2.762 

2.752 

2.746 

7 

3.257 

2.960 

2.827 

2.752 

2.703 

2.595 

2.535 

2.514 

2.504 

2.497 

8 

3.113 

2.806 

2.668 

2.589 

2.538 

2.425 

2.361 

2.339 

2.328 

2.321 

9 

3.006 

2.693 

2.551 

2.469 

2.416 

2.298 

2.232 

2.208 

2.196 

2.189 

10 

2.924 

2.605 

2.461 

2.377 

2.323 

2.201 

2.132 

2.107 

2.095 

2.087 

20 

2.589 

2.249 

2.091 

1.999 

1.937 

1.794 

1.708 

1.677 

1.660 

1.650 

30 

2.489 

2.142 

1.980 

1.884 

1.820 

1.667 

1.573 

1.538 

1.519 

1.507 

40 

2.440 

2.091 

1.927 

1.829 

1.763 

1.605 

1.506 

1.467 

1.447 

1.434 

50 

2.412 

2.061 

1.895 

1.796 

1.729 

1.568 

1.465 

1.424 

1.402 

1.389 

60 

2.393 

2.041 

1.875 

1.775 

1.707 

1.544 

1.437 

1.395 

1.372 

1.358 

70 

2.380 

2.027 

1.860 

1.760 

1.691 

1.526 

1.418 

1.374 

1.350 

1.335 

80 

2.370 

2.016 

1.849 

1.748 

1.680 

1.513 

1.403 

1.358 

1.334 

1.318 

90 

2.362 

2.008 

1.841 

1.739 

1.671 

1.503 

1.391 

1.346 

1.320 

1.304 

100 

2.356 

2.002 

1.834 

1.732 

1.663 

1.494 

1.382 

1.336 

1.310 

1.293 

200 

2.329 

1.973 

1.804 

1.701 

1.631 

1.458 

1.339 

1.289 

1.261 

1.242 

oo 

2.303 

1.945 

1.774 

1.670 

1.599 

1.421 

1.295 

1.240 

1.207 

1.185 

and  it  is  independent  of  the  form  of  the  parent  distribution  F(x).  For  large  values  of 
N,  we  can  use  the  asymptotic  equation 

P(Dn  <  z/Vn )  =  0(z). 

In  Table  A.25  are  listed  the  critical  values  of  vtVDa?  for  various  levels  of  probability. 
Values  of  the  Kolmogorov-Smirnov  statistic  above  the  critical  value  indicate  a 
rejection  of  the  null  hypothesis  that  the  data  are  drawn  from  the  parent  model. 

The  two-sample  Kolmogorov-Smirnov  statistic  is 


Dnm  =  max  I  Fm(x)  -  GN(x)j 

X 


where  FM(x)  and  GN(x)  are  the  sample  cumulative  distribution  of  two  independent 
sets  of  observations  of  size  M  and  N.  This  statistic  has  the  same  distribution  as  the 
one-sample  Kolmogorov-Smirnov  statistic,  with  the  substitution  of  MN  /  ( M  +  N )  in 
place  of  V,  and  in  the  limit  of  large  M  and  V,  (13. 12). 


Table  A.  14  Critical  values  of  F  statistic  that  include  p  =  0.95  probability 
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2.996  2.372  2.099  1.938  1.831  1.571  1.394  1.318  1.273  1.243 


Table  A.  15  Critical  values  of  F  statistic  that  include  p  =  0.99  probability 
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4.605  3.319  2.802  2.511  2.321  1.878  1.592  1.473  1.404  1.358 
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Appendix:  Numerical  Tables 
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Table  A.22  Integral  of  Student’s  function  (f  =  50),  or  probability  p,  as  function  of  critical  value  Tcrit 


A. 7  The  Kolmogorov-Smirnov  Test 


305 


ro 

Ol 

vo 

O' 

ro 

Ol 

vo 

Ol 

Ol 

ov 

Ol 

vo 

vo 

Ol 

vo 

o 

ov 

O' 

vo 

Ol 

O' 

vo 

Ol 

oo 

o 

ol 

vo 

H 

O' 

vo 

vo 

ro 

H 

H 

ov 

vo 

ov 

ro 

Ol 

O' 

oo 

Ol 

T, 

ro 

ov 

ov 

oo 

O' 

H 

vo 

O' 

Ol 

o 

o 

ro 

o 

ro 

oo 

O' 

ov 

Ol 

O' 

O' 

ro 

oo 

H 

ov 

vo 

H 

ro 

Ol 

vo 

vo 

Ol 

ro 

ov 

o 

O' 

OV 

O' 

H 

Ol 

o 

vo 

O' 

oo 

vo 

ro 

OV 

ro 

O' 

Ol 

o 

O' 

^1- 

o 

vo 

Ol 

O' 

H 

vo 

ov 

Ol 

vo 

oo 

o 

(N 

ro 

vo 

vo 

O' 

O' 

oo 

o 

H 

Ol 

ro 

ro 

vo 

vo 

vo 

vo 

O' 

O' 

O' 

oo 

oo 

oo 

ov 

OV 

OV 

ov 

ov 

ov 

OV 

ov 

ov 

ov 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

ro 

o 

o- 

vo 

o 

H 

ov 

Ol 

H 

vo 

vo 

vo 

oo 

ro 

O' 

vo 

O' 

o 

o 

ov 

oo 

ro 

ro 

o 

Ol 

vo 

OO 

oo 

VO 

ro 

ro 

ov 

O' 

H 

O' 

VO 

oo 

o 

vo 

oo 

Ol 

o 

OV 

ro 

I/O 

H 

ro 

vo 

1  ^ 

ro 

ov 

ov 

H 

vo 

vo 

Ol 

oo 

vo 

oo 

oo 

o 

O' 

ro 

o 

o 

oo 

ro 

Ol 

ov 

vo 

vo 

O 

o 

vo 

oo 

vo 

ro 

vo 

!  '0- 

ov 

o 

oo 

vo 

O' 

vo 

ro 

oo 

ro 

VO 

H 

ov 

vo 

ro 

o 

vo 

H 

vo 

H 

vo 

OV 

Ol 

vo 

O' 

o 

H 

ro 

vo 

vo 

O' 

O' 

oo 

O 

H 

Ol 

Ol 

ro 

vo 

vo 

vo 

vo 

O' 

O' 

O' 

oo 

oo 

oo 

ov 

ov 

OV 

ov 

ov 

OV 

ov 

ov 

ov 

oo 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

O' 

ov 

vo 

Ol 

o 

o 

O' 

Ol 

vo 

ro 

Ol 

O' 

ro 

oo 

H 

vo 

vo 

H 

ro 

vo 

ov 

ro 

ro 

Ol 

o 

Ol 

vo 

o 

O' 

vo 

O' 

oo 

vo 

ro 

ov 

O' 

O' 

ro 

O' 

vo 

H 

vo 

H 

o|- 

H 

ro 

vo 

ro 

O' 

o 

vo 

O' 

o 

o 

vo 

Ol 

Ol 

O 

H 

H 

Ol 

oo 

H 

vo 

vo 

ro 

Ol 

ol- 

ro 

O 

vo 

H 

O' 

ov 

oo 

vo 

H 

ro 

o 

Ol 

o 

ro 

Ol 

O' 

oo 

O' 

Ol 

vo 

vo 

vo 

Ol 

oo 

ro 

vo 

ro 

H 

oo 

vo 

Ol 

ov 

vo 

H 

vo 

H 

vo 

ov 

ol 

vo 

O' 

ov 

H 

ro 

vo 

vo 

O' 

O' 

oo 

o 

H 

Ol 

Ol 

ro 

vo 

vo 

vo 

O' 

O' 

O' 

oo 

oo 

oo 

oo 

ov 

OV 

ov 

ov 

ov 

OV 

ov 

ov 

r- 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

vo 

ov 

O' 

o 

o 

O' 

O' 

vo 

vo 

Ol 

Ol 

Ol 

o 

vo 

ro 

vo 

vo 

O' 

o 

oo 

oo 

H 

Ol 

O' 

o 

o- 

vo 

ro 

ov 

Ol 

H 

O' 

o 

Ol 

VO 

oo 

O' 

vo 

ro 

Ol 

O' 

Ol 

Ol 

O' 

o 

ov 

oo 

o 

vo 

o 

vo 

o 

O' 

H 

H 

ro 

O' 

^1- 

o 

^1- 

OV 

oo 

Ol 

ro 

O' 

O' 

vo 

O' 

vo 

ov 

Ol 

Ol 

O' 

ov 

vo 

oo 

vo 

oo 

vo 

o 

ov 

vo 

vo 

H 

vo 

H 

O' 

Ol 

Ol 

o 

O' 

vo 

Ol 

oo 

O 

vo 

o 

oo 

Ol 

O' 

ov 

H 

ro 

vo 

vo 

O' 

O' 

oo 

o 

H 

Ol 

Ol 

ro 

1 

vo 

vo 

vo 

O' 

O' 

O' 

oo 

oo 

oo 

oo 

ov 

OV 

ov 

ov 

ov 

ov 

ov 

ov 

VO 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

OO 

Ol 

OO 

H 

vo 

Ol 

ro 

H 

ro 

oo 

ro 

vo 

Ol 

oo 

Ol 

oo 

O' 

ro 

vo 

ol- 

o 

ov 

H 

O' 

ro 

oo 

ov 

ro 

ro 

ro 

Ol 

H 

ro 

oo 

oo 

o 

ov 

vo 

oo 

^t- 

O' 

o 

0' 

oo 

Ol 

ro 

O' 

vo 

vo 

ro 

H 

ro 

ol 

ro 

ol 

vo 

ro 

Ol 

ro 

oo 

ov 

vo 

vo 

O' 

0' 

O' 

Ol 

ro 

vo 

H 

Ol 

H 

ov 

oo 

vo 

Ol 

vo 

vo 

H 

ro 

o 

ro 

H 

Ol 

vo 

vo 

Ol 

ro 

ov 

ro 

ro 

H 

O' 

Ol 

ro 

H 

ov 

O' 

H 

oo 

o 

vo 

o 

oo 

H 

O' 

ov 

H 

Ol 

vo 

vo 

O' 

O' 

oo 

o 

H 

H 

Ol 

ro 

vo 

vo 

vo 

O' 

O' 

O' 

oo 

oo 

oo 

oo 

ov 

ov 

OV 

ov 

OV 

ov 

ov 

ov 

uo 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

oo 

oo 

ov 

ov 

O' 

vo 

vo 

o 

ro 

ro 

O' 

Ol 

O' 

O' 

H 

vo 

oo 

ov 

ro 

o 

o 

Ol 

o|- 

uo 

O' 

oo 

H 

vo 

o 

o 

ov 

vo 

vo 

oo 

ro 

ov 

o 

ro 

H 

oo 

OV 

O' 

vo 

ro 

oo 

Ol 

O' 

o- 

vo 

O' 

H 

ov 

Ol 

o 

Ol 

vo 

Ol 

Ol 

vo 

ov 

H 

O' 

ov 

Ol 

ov 

ro 

o- 

o|-  l 

vo 

O' 

H 

o 

oo 

oo 

oo 

O' 

vo 

oo 

vo 

o 

ov 

ro 

ro 

o 

Ol 

H 

oo 

H 

ro 

Ol 

O 

vo 

H 

ro 

H 

oo 

vo 

ro 

o 

O' 

ro 

ov 

ov 

O' 

H 

O' 

ov 

H 

Ol 

vo 

vo 

O' 

O' 

oo 

O 

H 

H 

Ol 

ro 

vo 

vo 

vo 

vo 

O' 

O' 

oo 

oo 

oo 

oo 

ov 

ov 

ov 

ov 

ov 

ov 

OV 

ov 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

Ol 

o 

H 

O' 

Ol 

O' 

vo 

H 

ro 

Ol 

O' 

O' 

ol- 

H 

ol- 

H 

vo 

H 

vo 

oh 

O' 

ov 

H 

H 

o- 

Ol 

VO 

ro 

o 

H 

vo 

ro 

ro 

vo 

oo 

H 

o 

oo 

H 

o 

oo 

ro 

H 

vo 

oo 

ov 

ov 

Ol 

ov 

vo 

Ol 

vo 

H 

O 

H 

vo 

O 

vo 

vo 

Ol 

O' 

O' 

ro 

oo 

O' 

H 

Ol 

ro 

Ol 

o 

O' 

o 

H 

oo 

H 

ov 

ro 

Ol 

vo 

vo 

o 

H 

O' 

O 

o 

vo 

o 

Ol 

H 

OV 

vo 

H 

Ol 

o 

oo 

vo 

ro 

o 

vo 

ro 

oo 

ov 

ro 

O' 

H 

o|- 

vo 

OV 

H 

Ol 

vo 

vo 

vo 

O' 

oo 

o 

H 

H 

Ol 

ro 

vo 

vo 

vo 

vo 

O' 

O' 

oo 

oo 

OO 

oo 

ov 

ov 

ov 

ov 

ov 

ov 

ov 

ov 

ro 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

r- 

vo 

ro 

oo 

vo 

Ol 

O' 

ro 

Ol 

o 

Ol 

vo 

Ol 

O' 

o 

O' 

ro 

ro 

ov 

ro 

ov 

ov 

oo 

ro 

o- 

ro 

ro 

ov 

H 

Ol 

ol 

ov 

H 

vo 

ro 

H 

vo 

oo 

ov 

O' 

oo 

vo 

Ol 

o 

H 

ro 

O 

oo 

O 

ol 

vo 

O' 

vo 

ov 

H 

oo 

o 

ro 

OV 

oo 

H 

H 

H 

ro 

Ol 

ol 

o 

o 

I/O 

OO 

vo 

vo 

ro 

ov 

ro 

H 

vo 

ro 

oo 

O' 

H 

H 

O' 

oo 

vo 

oo 

oo 

vo 

ov 

H 

H 

ov 

T; 

o 

H 

ov 

O' 

Ol 

ov 

VO 

Ol 

oo 

ro 

OO 

ro 

O' 

o 

ro 

vo 

oo 

o 

Ol 

ro 

vo 

vo 

vo 

O' 

oo 

o 

o 

H 

Ol 

ro 

ro 

vo 

vo 

vo 

vo 

O' 

O' 

oo 

oo 

oo 

oo 

ov 

ov 

OV 

ov 

ov 

ov 

OV 

ov 

(N 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

ov 

o 

ov 

H 

ro 

vo 

vo 

oo 

Ol 

Ol 

Ol 

H 

Ol 

oo 

o 

ro 

vo 

vo 

ro 

ro 

oo 

vo 

ov 

vo 

O' 

ro 

vo 

o- 

vo 

^|- 

o 

O' 

ov 

Ol 

H 

ov 

ro 

H 

O' 

vo 

o 

ro 

OV 

ol 

vo 

H 

oo 

Ol 

ov 

H 

H 

O' 

ro 

ov 

Ol 

oo 

vo 

vo 

o 

oo 

Ol 

vo 

ro 

vo 

vo 

H 

H 

H 

Ol 

ov 

ro 

O' 

O' 

vr, 

Ol 

vo 

O' 

vo 

oo 

oo 

Ol 

Ol 

O' 

oo 

ro 

vo 

Ol 

vo 

vo 

ro 

oo 

o 

o 

oo 

o 

o 

oo 

vo 

H 

oo 

vo 

H 

O' 

ro 

oo 

Ol 

vo 

o 

ro 

vo 

oo 

o 

Ol 

ro 

vo 

vo 

vo 

O' 

oo 

o 

o 

H 

Ol 

ro 

ro 

vo 

vo 

vo 

vo 

O' 

O' 

oo 

oo 

oo 

oo 

ov 

ov 

OV 

ov 

OV 

ov 

ov 

ov 

H 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

o 

vo 

oo 

ov 

o 

Ol 

ov 

vo 

vo 

vo 

o 

vo 

oo 

Ol 

oo 

H 

vo 

vo 

ro 

ro 

vo 

of- 

H 

o 

o 

vo 

o 

O' 

ro 

oo 

Ol 

o 

vo 

O' 

o 

o 

ro 

H 

ov 

o 

vo 

o 

OV 

vo 

o 

o 

Ol 

o- 

vo 

H 

O' 

O' 

oo 

vo 

vo 

oo 

Ol 

ro 

o 

H 

vo 

H 

O' 

o 

Ol 

vo 

ro 

oo 

o 

ov 

o- 

ov 

o 

oo 

Ol 

Ol 

O' 

O' 

ro 

o 

Ol 

o 

Ol 

vo 

ov 

ov 

O' 

ov 

o 

o- 

vn 

ro 

o 

oo 

H 

O' 

Ol 

O' 

Ol 

vo 

o 

ro 

vo 

oo 

o 

Ol 

ro 

vo 

vo 

O' 

O' 

o 

o 

H 

Ol 

ro 

ro 

vo 

vo 

vo 

vo 

O' 

O' 

oo 

oo 

oo 

oo 

ov 

ov 

OV 

ov 

ov 

OV 

OV 

ov 

o 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

o 

H 

Ol 

ro 

vo 

vo 

O' 

oo 

ov 

o 

H 

Ol 

ro 

Of-_ 

vo 

vo 

O'; 

oo 

ov 

o 

H 

Ol 

ro 

d 

d 

d 

d 

d 

d 

d 

d 

d 

d 

H 

H 

H 

H 

H 

H 

H 

H 

H 

H 

oi 

oi 

oi 

oi 

oi 

(continued) 


Table  A.22  (continued) 


306 


Appendix:  Numerical  Tables 
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Table  A.23  Comparison  of  integrals  of  Student’s  function  at  different  critical  values. 


Critical  value 

/ 

T  =  1 

T  =  2 

T  =  3 

T  =  4 

T  =  5 

1 

0.500000 

0.704833 

0.795168 

0.844042 

0.874334 

2 

0.577351 

0.816497 

0.904534 

0.942809 

0.962251 

3 

0.608998 

0.860674 

0.942332 

0.971992 

0.984608 

4  [ 

0.626099 

0.883884 

0.960058 

0.983870 

0.992510 

5 

0.636783 

0.898061 

0.969901 

0.989677 

0.995896 

6 

0.644083 

0.907574 

0.975992 

0.992881 

0.997548 

7 

0.649384 

0.914381 

0.980058 

0.994811 

0.998435 

8 

0.653407 

0.919484 

0.982929 

0.996051 

0.998948 

9 

0.656564 

0.923448 

0.985044 

0.996890 

0.999261 

10 

0.659107 

0.926612 

0.986657 

0.997482 

0.999463 

11 

0.661200 

0.929196 

0.987921 

0.997914 

0.999598 

12 

0.662951 

0.931345 

0.988934 

0.998239 

0.999691 

13 

0.664439 

0.933160 

0.989762 

0.998488 

0.999757 

14  ] 

0.665718 

0.934712 

.990449 

0.998684 

0.999806 

15 

0.666830 

0.936055 

0.991028 

0.998841 

0.999842 

16 

0.667805 

0.937228 

0.991521 

0.998968 

0.999870 

17 

0.668668 

0.938262 

0.991946 

0.999073 

0.999891 

18 

0.669435 

0.939179 

0.992315 

0.999161 

0.999908 

19 

0.670123 

0.939998 

0.992639 

0.999234 

0.999921 

20 

0.670744 

0.940735 

0.992925 

0.999297 

0.999932 

21 

0.671306 

0.941400 

0.993179 

0.999351 

0.999940 

22 

0.671817 

0.942005 

0.993406 

0.999397 

0.999948 

23 

0.672284 

0.942556 

0.993610 

0.999438 

0.999954 

24  1 

0.672713 

0.943061 

0.993795 

0.999474 

0.999959 

25 

0.673108 

0.943524 

0.993962 

0.999505 

0.999963 

26 

0.673473 

0.943952 

0.994115 

0.999533 

0.999967 

27 

0.673811 

0.944348 

0.994255 

0.999558 

0.999970 

28 

0.674126 

0.944715 

0.994383 

0.999580 

0.999973 

29 

0.674418 

0.945057 

0.994501 

0.999600 

0.999975 

30 

0.674692 

0.945375 

0.994610 

0.999619 

0.999977 

31 

0.674948 

0.945673 

0.994712 

0.999635 

0.999979 

32 

0.675188 

0.945952 

0.994806 

0.999650 

0.999981 

33 

0.675413 

0.946214 

0.994893 

0.999664 

0.999982 

34 

0.675626 

0.946461 

0.994975 

0.999677 

0.999983 

35 

0.675826 

0.946693 

0.995052 

0.999688 

0.999984 

36 

0.676015 

0.946912 

0.995123 

0.999699 

0.999985 

37 

0.676194 

0.947119 

0.995191 

0.999709 

0.999986 

38 

0.676364 

0.947315 

0.995254 

0.999718 

0.999987 

39 

0.676525 

0.947501 

0.995314 

0.999727 

0.999988 

40 

0.676678 

0.947678 

0.995370 

0.999735 

0.999989 

(continued) 
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Appendix:  Numerical  Tables 


Table  A.23  (continued) 


Critical  value 

/ 

T  =  1 

T  =  2 

T  =  3 

T  =  4 

T  =  5 

41 

0.676824 

0.947846 

0.995424 

0.999742 

0.999989 

42 

0.676963 

0.948006 

0.995474 

0.999749 

0.999990 

43 

0.677095 

0.948158 

0.995522 

0.999755 

0.999990 

44  1 

0.677222 

0.948304 

0.995568 

0.999761 

0.999991 

45 

0.677343 

0.948443 

0.995611 

0.999767 

0.999991 

46 

0.677458 

0.948576 

0.995652 

0.999773 

0.999992 

47 

0.677569 

0.948703 

0.995691 

0.999778 

0.999992 

48 

0.677675 

0.948824 

0.995729 

0.999782 

0.999992 

49 

0.677777 

0.948941 

0.995765 

0.999787 

0.999993 

50 

0.677875 

0.949053 

0.995799 

0.999791 

0.999993 

oo 

0.682690 

0.954500 

0.997301 

0.999937 

1 .000000 

Table  A.24  Critical  values  of  the  linear  correlation  coefficient 

Probability  p  to  have  an  absolute  value  of  r  below  the  critical  value 


/ 

0.50 

0.60 

0.70 

0.80 

0.90 

0.95 

0.99 

2 

0.500 

0.600 

0.700 

0.800 

0.900 

0.950 

0.990 

3 

0.404 

0.492 

0.585 

0.687 

0.805 

0.878 

0.959 

4 

0.347 

0.426 

0.511 

0.608 

0.729 

0.811 

0.917 

5 

0.309 

0.380 

0.459 

0.551 

0.669 

0.754 

0.875 

6 

0.281 

0.347 

0.420 

0.507 

0.621 

0.707 

0.834 

7 

0.260 

0.321 

0.390 

0.472 

0.582 

0.666 

0.798 

8 

0.242 

0.300 

0.365 

0.443 

0.549 

0.632 

0.765 

9 

0.228 

0.282 

0.344 

0.419 

0.521 

0.602 

0.735 

10 

0.216 

0.268 

0.327 

0.398 

0.497 

0.576 

0.708 

20 

0.152 

0.189 

0.231 

0.284 

0.360 

0.423 

0.537 

30 

0.124 

0.154 

0.189 

0.233 

0.296 

0.349 

0.449 

40 

0.107 

0.133 

0.164 

0.202 

0.257 

0.304 

0.393 

50 

0.096 

0.119 

0.147 

0.181 

0.231 

0.273 

0.354 

60 

0.087 

0.109 

0.134 

0.165 

0.211 

0.250 

0.325 

70 

0.081 

0.101 

0.124 

0.153 

0.195 

0.232 

0.302 

80 

0.076 

0.094 

0.116 

0.143 

0.183 

0.217 

0.283 

90 

0.071 

0.089 

0.109 

0.135 

0.173 

0.205 

0.267 

100 

0.068 

0.084 

0.104 

0.128 

0.164 

0.195 

0.254 

200 

0.048 

0.060 

0.073 

0.091 

0.116 

0.138 

0.181 

300 

0.039 

0.049 

0.060 

0.074 

0.095 

0.113 

0.148 

500 

0.030 

0.038 

0.046 

0.057 

0.073 

0.087 

0.114 

1000 

0.021 

0.027 

0.033 

0.041 

0.052 

0.062 

0.081 
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Table  A.25  Critical  values  of  the  Kolmogorov-Smirnov  statistic  DN 


Probability  p  to  have  DN  x  */~N  below  the  critical  value 


N 

0.50 

0.60 

0.70 

0.80 

0.90 

0.95 

0.99 

1 

0.750 

0.800 

0.850 

0.900 

0.950 

0.975 

0.995 

2 

0.707 

0.782 

0.866 

0.967 

1.098 

1.191 

1.314 

3 

0.753 

0.819 

0.891 

0.978 

1.102 

1.226 

1.436 

4 

0.762 

0.824 

0.894 

0.985 

1.130 

1.248 

1.468 

5 

0.765 

0.827 

0.902 

0.999 

1.139 

1.260 

1.495 

6 

0.767 

0.833 

0.910 

1.005 

1.146 

1.272 

1.510 

7 

0.772 

0.838 

0.914 

1.009 

1.154 

1.279 

1.523 

8 

0.776 

0.842 

0.917 

1.013 

1.159 

1.285 

1.532 

9 

0.779 

0.844 

0.920 

1.017 

1.162 

1.290 

1.540 

10 

0.781 

0.846 

0.923 

1.020 

1.166 

1.294 

1.546 

15 

0.788 

0.855 

0.932 

1.030 

1.177 

1.308 

1.565 

20 

0.793 

0.860 

0.937 

1.035 

1.184 

1.315 

1.576 

25 

0.796 

0.863 

0.941 

1.039 

1.188 

1.320 

1.583 

30 

0.799 

0.866 

0.943 

1.042 

1.192 

1.324 

1.588 

35 

0.801 

0.868 

0.946 

1.045 

1.194 

1.327 

1.591 

40 

0.803 

0.869 

0.947 

1.046 

1.196 

1.329 

1.594 

45 

0.804 

0.871 

0.949 

1.048 

1.198 

1.331 

1.596 

50 

0.805 

0.872 

0.950 

1.049 

1.199 

1.332 

1.598 

60 

0.807 

0.874 

0.952 

1.051 

1.201 

1.335 

1.601 

70 

0.808 

0.875 

0.953 

1.053 

1.203 

1.337 

1.604 

80 

0.810 

0.877 

0.955 

1.054 

1.205 

1.338 

1.605 

90 

0.811 

0.878 

0.956 

1.055 

1.206 

1.339 

1.607 

100 

0.811 

0.879 

0.957 

1.056 

1.207 

1.340 

1.608 

200 

0.816 

0.883 

0.961 

1.061 

1.212 

1.346 

1.614 

300 

0.818 

0.885 

0.964 

1.063 

1.214 

1.348 

1.617 

500 

0.820 

0.887 

0.966 

1.065 

1.216 

1.350 

1.620 

1000 

0.822 

0.890 

0.968 

1.067 

1.218 

1.353 

1.622 

oo 

0.828 

0.895 

0.973 

1.073 

1.224 

1.358 

1.628 
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X2  distribution,  122 
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mean,  125 
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Xian  statistic,  149,  177 
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hypothesis  testing,  178 
probability  function,  178 
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R  statistic,  266 
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acceptance  probability,  252 
accessible  state,  241 
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Anderson,  E.,  165 
auxiliary  distribution,  25 1 
average 


linear,  107 

relative-error  weighted,  113 
weighted,  107 

Bayes’  theorem,  10 
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Bayesian  method  for  Poisson  mean,  102 

Bayesian  statistics,  12 
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Beta  function,  133 

binary  experiment,  35 

binning  of  data,  162 

Kolmogorov  Smirnov  test,  216 
binomial  distribution,  35 

comparison  with  Gaussian  and  Poisson,  51 
Ehrenfest  chain,  246 
mean,  38 
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probability  function,  38 
variance,  39 
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bivariate  data,  203 
use  of  x2,  209 
bootstrap  simulation,  230 
synthetic  dataset,  230 
unbiased  estimator,  232 
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candidate  of  MCMC,  251 
Cartesian  coordinates,  65 
transformation  to  polar,  8 1 
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approximate  distribution,  180 
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model  parameters,  181,  186 
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reduced  parameters,  184 
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error  propagation,  55,  70 
error  propagation  formula,  7 1 
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experiment,  1 
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exponential  distribution,  19 

cumulative  distribution  function,  20 
probability  function,  19 
simulation,  79 

F  statistic,  131 

approximations,  134 
degrees  of  freedom,  134 
distribution  function,  132 
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mean,  134 
variance,  134 
tables,  282 
F  test,  211 
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multi-variable  linear  regression,  172 
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use  on  same  dataset,  213 
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full  conditional  distribution,  258 
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mean,  69 

multi-dimensional  method,  66 
variance,  70 

gamma  distribution,  123 
Gamma  function,  124,  134 
asymptotic  expansion,  288 
Gaussian  distribution,  40 
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confidence  intervals,  94 
cumulative  distribution  function,  44 
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mean,  43 
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probability  function,  43 

simulation,  45,  79 
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variance,  43 

tables,  273 

Geherels  approximation,  99 

tables  of  upper  and  lower  limits,  277 
Gelman  Rubin  statistic,  260,  264 
between-chain  variance,  265 
within-chain  variance,  265 
genes  and  genotypes,  8 
Geweke  Z  score,  260,  263 
Gibbs  sampler,  258 
gibbsit  software,  268 
goodness  of  fit 
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Cash  statistic,  180 
Gaussian  data,  149 
Poisson  data,  161 
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Hubble  constant,  158 
Hubble’s  law,  157 
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X2  distribution,  126 
Xmin  statistic,  178 
acceptable  region,  118 
confidence  level,  118 
F  statistic,  134 
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linear  correlation  coefficient,  190 
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sampling  distribution  of  variance,  131 
Student’s  t  distribution,  141 
acceptability  of  null  hypothesis,  120 
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impossible  event,  1 
independent  events,  5 
independent  variables,  28 
two-variable  data,  1 87 
interesting  and  uninteresting  parameters,  184 
intrinsic  covariance,  204 
intrinsic  scatter,  196 
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direct  calculation,  196 
parameter  estimation,  200 
intrinsic  variance,  197,  204 
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joint  distribution  function,  26 
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approximation,  218 
comparison  of  data  with  model,  216 
non-parametric  nature,  216 
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Gaussian  data,  85 
Poisson  data,  50 
linear  average,  107 
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hypothesis  testing,  190 
probability  function,  188 
tables,  286,  308 
linear  regression,  150 
error  matrix,  153 
identical  errors,  151 
multi-variable,  168 
identical  errors  or  no  errors,  155 
model  sample  variance,  156 
parameter  errors  and  covariance,  154 
log-normal  distribution,  110 
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logarithmic  average,  109 
weighted,  110 

marginalization  of  random  variables,  29,  3 1 
marginalization  of  uninteresting  parameters, 
185 

Markov  chains,  237 
accessible  states,  241 
dependence  of  samples,  250 
Markovian  property,  238 
recurrent  and  transient  states,  240 
short  memory  property,  238 
state  of  system,  238 
communication  of  states,  243 
detailed  balance,  254 
irreducible  aperiodic  chains,  244 
limiting  probability,  243 
periodicity,  244 

recurrent  and  transient  states,  242 
stationary  distribution,  243 
time  reversible,  253 
Markovian  property,  238 
mass  distribution  function,  see  probability 
mass  function 

maximum  likelihood  method 
bivariate  data,  204 
fit  to  non-linear  functions,  160 
fit  to  two-dimensional  data,  149 
Gaussian  data,  149 
Gaussian  variable,  85 
estimate  of  mean,  86 
estimate  of  sample  variance,  87 
estimate  of  variance,  87 
other  variables,  90 
Poisson  data,  160 
Poisson  variable,  90 

MCMC  ,  see  Monte  Carlo  Markov  chains 
mean,  21 

function  of  random  variables,  69 
linear  combination  of  variables,  55 
weighted,  89 

Bayesian  expectation  for  Poisson  mean, 
102 

non-uniform  errors,  88 
median,  22,  109 

insensitivity  to  outliers,  109 
Mendel,  G.,  8 
method  of  moments,  91 
Metropolis  Hastings  algorithm,  25 1 

case  of  uniform  priors  and  proposals, 
253 

justification  of  method,  253 
proposal  (auxiliary)  distribution,  25 1 


mixing  properties  of  MCMC,  262 
mode,  22 

model  sample  variance,  156 
moment  generating  function,  58 
properties,  59 
Gaussian  distribution,  59 
Poisson  distribution,  60 
sum  of  Poisson  variables,  60 
sum  of  uniform  distributions,  63 
sum  of  uniform  variables,  62 
moments  of  distribution  function,  20 
Monte  Carlo,  225 

function  evaluation,  227 
integration,  226 
dart  method,  227 

multi-dimensional  integration,  227 
simulation  of  variables,  228 
Monte  Carlo  Markov  chains,  249 
acceptance  probability,  252 
candidates,  251 
prior  distribution,  25 1 
burn-in  period,  259,  263 
convergence  tests,  259 
correlation  of  links,  261 
mixing,  262 

posterior  distribution,  250,  252 
stopping  time,  260 
thinning,  263 

multi-variable  dataset,  165 
multi-variable  linear  regression,  168 
coefficient  of  determination,  174 
design  matrix,  169 
error  matrix,  169 
F  test,  172 
Iris  data,  171 
T  test,  170 

tests  for  significance,  170 
multiple  linear  regression,  151 
best-fit  parameters,  152 
error  matrix,  153 
parameter  errors,  153 
multiplicative  errors,  109 
mutually  exclusive  events,  2 

nested  component,  211,  214 
normal  distribution,  40 
null  hypothesis,  118 

Occam’s  razor,  216 
Occam,  William  of,  216 
orthonormal  transformation,  129 
overbooking,  probability,  39 
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parent  distribution,  18 
parent  mean,  22 

comparison  with  sample  mean,  137 
partition  of  sample  space,  2 
Pearson,  K.,  30 

data  on  biometric  characteristics, 

30 

periodicity  of  Markov  chains,  244 
permutations,  36 
photon  counting  experiment,  29 
Poisson  distribution,  45 

Bayesian  expectation  of  mean,  103 
Baysian  upper  and  lower  limits,  103 
comparison  with  binomial  and  Gaussian, 
51 

likelihood,  49 
mean,  46 

moment  generating  function,  60 
posterior  distribution,  50 
posterior  probability,  49,  102 
S  parameter,  100 
upper  and  lower  limits,  98 
variance,  47 
Poisson  process,  48 
polar  coordinates,  65 

transformation  to  Cartesian,  8 1 
posterior  distribution,  241 
Markov  chains,  250 
posterior  probability,  11,  13 
Poisson  mean,  102 
prior  distribution,  250 
MCMC,  251 
prior  probability,  1 1 
probability 

Bayesian  method,  1,  4 
classical  method,  3 
empirical  method,  4 
frequentist  method,  1,  3 
fundamental  properties,  4 
probability  distribution  function,  19 
probability  mass  function,  19 
probability  of  event,  2 
proposal  distribution,  25 1 

quantile,  93,  268 
quantile  function,  76 

exponential  distribution,  77 
uniform  distribution,  77 

Raftery  Lewis  test,  267 
random  error,  198 
random  variables,  17 


random  walk,  237,  239 
recurrence  of  states,  241 
transition  probability,  239 
recurrent  states,  247 
Rayleigh  distribution,  66 

cumulative  distribution,  80 
quantile  function,  80 
reduced  /2,  125 
rejection  of  hypothesis,  120 
rejection  region,  118 
one-sided,  119 
two-sided,  119 

relative  uncertainty  of  random  variable,  57,  72 
relative-error  weighted  average,  113 
resampled  dataset,  234 
residual  variance,  172 

sample  correlation  coefficient,  27,  28 
sample  covariance,  27,  28 
sample  distribution,  18 
sample  mean,  21 

comparison  of  two  means,  141 
comparison  with  parent  mean,  137 
sample  space,  1 
partition,  2 
sample  variance,  23 
sampling  distribution  of  mean,  137 
sampling  distribution  of  variance,  127 
degrees  of  freedom,  130 
probability  function,  130 
sequence  of  events,  36 
sequence  of  random  variables,  237 
signal-to-noise  ratio,  72 
simulation  of  number  n,  228 
simulation  of  random  variables,  78 
exponential,  79 
Gaussian,  79,  229 
Monte  Carlo  methods,  228 
square  of  uniform  distribution,  79 
standard  deviation,  22 
standard  error,  22 
standard  Gaussian,  45 
stationary  distribution,  237 
statistic,  117 
statistical  error,  108,  199 
statistical  independence,  5 
necessary  conditions  for,  6 
statistical  independence  of  random  variables, 
28 

Stirling’s  approximation,  51 
stochastic  processes,  237 
stopping  time  of  MCMC,  260 
strong  law  of  large  numbers,  68 
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Student’s  t  distribution,  137 

comparison  of  two  sample  means,  142 
degrees  of  freedom,  139 
hypothesis  testing,  141 
mean,  140 

probability  function,  139 
tables,  285 

sum  of  random  variables,  62 
synthetic  dataset,  230 
systematic  error,  108,  199 
additive,  199 
multiplicative,  199 
parameter  estimation,  200 

T  test,  170 

thinning  of  MCMC,  263 
Thomson,  J.J.,  23 

analysis  of  experimental  data,  222 
discovery  of  electron,  23 
Total  probability  theorem,  10 
transition  kernel,  239 
transition  probability,  239 
triangular  distribution,  64,  67 
two-variable  dataset,  147 
bivariate  errors,  203 
independent  variable,  147,  187 
Monte  Carlo  estimates  of  errors,  230 

uncorrelated  variables,  27,  56 
uniform  distribution,  67 
probability  function,  67 
simulation,  79 


square,  69 

sum  of  two  variables,  67 
upper  and  lower  limits,  94 

Bayesian  method  for  Poisson  mean,  103 
Gaussian  variables,  95 
Geherels  approximation,  99 
Poisson  variable,  98,  99 
upper  limit  to  non  detection,  96 
Bayesian,  103 
Gaussian,  96 
Poisson,  101 

variance 

debiased,  204 
explained,  173 
intrinsic,  204 

linear  combination  of  variables,  56 
residual,  172 
weighted  mean,  90 
anti-correlated  variables,  56 
correlated  variables,  57 
function  of  random  variables,  70 
variance  of  random  variable,  22 
variance  of  sum  of  variables,  28 

weighted  logarithmic  average,  110 
weighted  mean,  89,  107 
variance,  90 

z-score,  45,  138 


